Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Large-scale machine learning for metagenomics sequence classification

Identifieur interne : 000085 ( Pmc/Corpus ); précédent : 000084; suivant : 000086

Large-scale machine learning for metagenomics sequence classification

Auteurs : Kévin Vervier ; Pierre Mahé ; Maud Tournoud ; Jean-Baptiste Veyrieras ; Jean-Philippe Vert

Source :

RBID : PMC:4896366

Abstract

Motivation: Metagenomics characterizes the taxonomic diversity of microbial communities by sequencing DNA directly from an environmental sample. One of the main challenges in metagenomics data analysis is the binning step, where each sequenced read is assigned to a taxonomic clade. Because of the large volume of metagenomics datasets, binning methods need fast and accurate algorithms that can operate with reasonable computing requirements. While standard alignment-based methods provide state-of-the-art performance, compositional approaches that assign a taxonomic class to a DNA read based on the k-mers it contains have the potential to provide faster solutions.

Results: We propose a new rank-flexible machine learning-based compositional approach for taxonomic assignment of metagenomics reads and show that it benefits from increasing the number of fragments sampled from reference genome to tune its parameters, up to a coverage of about 10, and from increasing the k-mer size to about 12. Tuning the method involves training machine learning models on about 108 samples in 107 dimensions, which is out of reach of standard softwares but can be done efficiently with modern implementations for large-scale machine learning. The resulting method is competitive in terms of accuracy with well-established alignment and composition-based tools for problems involving a small to moderate number of candidate species and for reasonable amounts of sequencing errors. We show, however, that machine learning-based compositional approaches are still limited in their ability to deal with problems involving a greater number of species and more sensitive to sequencing errors. We finally show that the new method outperforms the state-of-the-art in its ability to classify reads from species of lineage absent from the reference database and confirm that compositional approaches achieve faster prediction times, with a gain of 2–17 times with respect to the BWA-MEM short read mapper, depending on the number of candidate species and the level of sequencing noise.

Availability and implementation: Data and codes are available at http://cbio.ensmp.fr/largescalemetagenomics.

Contact:pierre.mahe@biomerieux.com

Supplementary information:Supplementary data are available at Bioinformatics online.


Url:
DOI: 10.1093/bioinformatics/btv683
PubMed: 26589281
PubMed Central: 4896366

Links to Exploration step

PMC:4896366

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Large-scale machine learning for metagenomics sequence classification</title>
<author>
<name sortKey="Vervier, Kevin" sort="Vervier, Kevin" uniqKey="Vervier K" first="Kévin" last="Vervier">Kévin Vervier</name>
<affiliation>
<nlm:aff id="btv683-AFF1"></nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="btv683-AFF2"></nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="btv683-AFF3"></nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="btv683-AFF4"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Mahe, Pierre" sort="Mahe, Pierre" uniqKey="Mahe P" first="Pierre" last="Mahé">Pierre Mahé</name>
<affiliation>
<nlm:aff id="btv683-AFF1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Tournoud, Maud" sort="Tournoud, Maud" uniqKey="Tournoud M" first="Maud" last="Tournoud">Maud Tournoud</name>
<affiliation>
<nlm:aff id="btv683-AFF1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Veyrieras, Jean Baptiste" sort="Veyrieras, Jean Baptiste" uniqKey="Veyrieras J" first="Jean-Baptiste" last="Veyrieras">Jean-Baptiste Veyrieras</name>
<affiliation>
<nlm:aff id="btv683-AFF1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Vert, Jean Philippe" sort="Vert, Jean Philippe" uniqKey="Vert J" first="Jean-Philippe" last="Vert">Jean-Philippe Vert</name>
<affiliation>
<nlm:aff id="btv683-AFF2"></nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="btv683-AFF3"></nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="btv683-AFF4"></nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">26589281</idno>
<idno type="pmc">4896366</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4896366</idno>
<idno type="RBID">PMC:4896366</idno>
<idno type="doi">10.1093/bioinformatics/btv683</idno>
<date when="2015">2015</date>
<idno type="wicri:Area/Pmc/Corpus">000085</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000085</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Large-scale machine learning for metagenomics sequence classification</title>
<author>
<name sortKey="Vervier, Kevin" sort="Vervier, Kevin" uniqKey="Vervier K" first="Kévin" last="Vervier">Kévin Vervier</name>
<affiliation>
<nlm:aff id="btv683-AFF1"></nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="btv683-AFF2"></nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="btv683-AFF3"></nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="btv683-AFF4"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Mahe, Pierre" sort="Mahe, Pierre" uniqKey="Mahe P" first="Pierre" last="Mahé">Pierre Mahé</name>
<affiliation>
<nlm:aff id="btv683-AFF1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Tournoud, Maud" sort="Tournoud, Maud" uniqKey="Tournoud M" first="Maud" last="Tournoud">Maud Tournoud</name>
<affiliation>
<nlm:aff id="btv683-AFF1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Veyrieras, Jean Baptiste" sort="Veyrieras, Jean Baptiste" uniqKey="Veyrieras J" first="Jean-Baptiste" last="Veyrieras">Jean-Baptiste Veyrieras</name>
<affiliation>
<nlm:aff id="btv683-AFF1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Vert, Jean Philippe" sort="Vert, Jean Philippe" uniqKey="Vert J" first="Jean-Philippe" last="Vert">Jean-Philippe Vert</name>
<affiliation>
<nlm:aff id="btv683-AFF2"></nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="btv683-AFF3"></nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="btv683-AFF4"></nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Bioinformatics</title>
<idno type="ISSN">1367-4803</idno>
<idno type="eISSN">1367-4811</idno>
<imprint>
<date when="2015">2015</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>
<bold>Motivation:</bold>
Metagenomics characterizes the taxonomic diversity of microbial communities by sequencing DNA directly from an environmental sample. One of the main challenges in metagenomics data analysis is the binning step, where each sequenced read is assigned to a taxonomic clade. Because of the large volume of metagenomics datasets, binning methods need fast and accurate algorithms that can operate with reasonable computing requirements. While standard alignment-based methods provide state-of-the-art performance, compositional approaches that assign a taxonomic class to a DNA read based on the
<italic>k</italic>
-mers it contains have the potential to provide faster solutions.</p>
<p>
<bold>Results:</bold>
We propose a new rank-flexible machine learning-based compositional approach for taxonomic assignment of metagenomics reads and show that it benefits from increasing the number of fragments sampled from reference genome to tune its parameters, up to a coverage of about 10, and from increasing the
<italic>k</italic>
-mer size to about 12. Tuning the method involves training machine learning models on about 10
<sup>8</sup>
samples in 10
<sup>7</sup>
dimensions, which is out of reach of standard softwares but can be done efficiently with modern implementations for large-scale machine learning. The resulting method is competitive in terms of accuracy with well-established alignment and composition-based tools for problems involving a small to moderate number of candidate species and for reasonable amounts of sequencing errors. We show, however, that machine learning-based compositional approaches are still limited in their ability to deal with problems involving a greater number of species and more sensitive to sequencing errors. We finally show that the new method outperforms the state-of-the-art in its ability to classify reads from species of lineage absent from the reference database and confirm that compositional approaches achieve faster prediction times, with a gain of 2–17 times with respect to the BWA-MEM short read mapper, depending on the number of candidate species and the level of sequencing noise.</p>
<p>
<bold>Availability and implementation:</bold>
Data and codes are available at
<ext-link ext-link-type="uri" xlink:href="http://cbio.ensmp.fr/largescalemetagenomics">http://cbio.ensmp.fr/largescalemetagenomics</ext-link>
.</p>
<p>
<bold>Contact:</bold>
<email>pierre.mahe@biomerieux.com</email>
</p>
<p>
<bold>Supplementary information:</bold>
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/lookup/suppl/doi:10.1093/bioinformatics/btv683/-/DC1">Supplementary data</ext-link>
are available at
<italic>Bioinformatics</italic>
online.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Agarwal, A" uniqKey="Agarwal A">A. Agarwal</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Angly, F" uniqKey="Angly F">F. Angly</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Balzer, S" uniqKey="Balzer S">S. Balzer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Beygelzimer, A" uniqKey="Beygelzimer A">A. Beygelzimer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bottou, L" uniqKey="Bottou L">L. Bottou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bottou, L" uniqKey="Bottou L">L. Bottou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gammerman, A" uniqKey="Gammerman A">A. Gammerman</name>
</author>
<author>
<name sortKey="Vovk, V" uniqKey="Vovk V">V. Vovk</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hugenholtz, P" uniqKey="Hugenholtz P">P. Hugenholtz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huson, D" uniqKey="Huson D">D. Huson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Korbel, J" uniqKey="Korbel J">J. Korbel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Koslicki, D" uniqKey="Koslicki D">D. Koslicki</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Langford, J" uniqKey="Langford J">J. Langford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H. Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H. Li</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R. Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lindner, M S" uniqKey="Lindner M">M.S. Lindner</name>
</author>
<author>
<name sortKey="Renard, B Y" uniqKey="Renard B">B.Y. Renard</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lukjancenko, O" uniqKey="Lukjancenko O">O. Lukjancenko</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mande, S" uniqKey="Mande S">S. Mande</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Martin, J" uniqKey="Martin J">J. Martin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mchardy, A C" uniqKey="Mchardy A">A.C. McHardy</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Miller, R" uniqKey="Miller R">R. Miller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Parks, D" uniqKey="Parks D">D. Parks</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Patil, K" uniqKey="Patil K">K. Patil</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Peterson, J" uniqKey="Peterson J">J. Peterson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pruitt, K" uniqKey="Pruitt K">K. Pruitt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Riesenfeld, C" uniqKey="Riesenfeld C">C. Riesenfeld</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rosen, G" uniqKey="Rosen G">G. Rosen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schmieder, R" uniqKey="Schmieder R">R. Schmieder</name>
</author>
<author>
<name sortKey="Edwards, R" uniqKey="Edwards R">R. Edwards</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sonnenburg, S" uniqKey="Sonnenburg S">S. Sonnenburg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Soon, W" uniqKey="Soon W">W. Soon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wang, Q" uniqKey="Wang Q">Q. Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wood, D E" uniqKey="Wood D">D.E. Wood</name>
</author>
<author>
<name sortKey="Salzberg, S L" uniqKey="Salzberg S">S.L. Salzberg</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">Bioinformatics</journal-id>
<journal-id journal-id-type="publisher-id">bioinformatics</journal-id>
<journal-id journal-id-type="hwp">bioinfo</journal-id>
<journal-title-group>
<journal-title>Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="ppub">1367-4803</issn>
<issn pub-type="epub">1367-4811</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">26589281</article-id>
<article-id pub-id-type="pmc">4896366</article-id>
<article-id pub-id-type="doi">10.1093/bioinformatics/btv683</article-id>
<article-id pub-id-type="publisher-id">btv683</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Original Papers</subject>
<subj-group subj-group-type="heading">
<subject>Sequence Analysis</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Large-scale machine learning for metagenomics sequence classification</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Vervier</surname>
<given-names>Kévin</given-names>
</name>
<xref ref-type="aff" rid="btv683-AFF1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="btv683-AFF2">
<sup>2</sup>
</xref>
<xref ref-type="aff" rid="btv683-AFF3">
<sup>3</sup>
</xref>
<xref ref-type="aff" rid="btv683-AFF4">
<sup>4</sup>
</xref>
<xref ref-type="author-notes" rid="btv683-FN1">
<sup></sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Mahé</surname>
<given-names>Pierre</given-names>
</name>
<xref ref-type="aff" rid="btv683-AFF1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="btv683-COR1">*</xref>
<xref ref-type="author-notes" rid="btv683-FN1">
<sup></sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Tournoud</surname>
<given-names>Maud</given-names>
</name>
<xref ref-type="aff" rid="btv683-AFF1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Veyrieras</surname>
<given-names>Jean-Baptiste</given-names>
</name>
<xref ref-type="aff" rid="btv683-AFF1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Vert</surname>
<given-names>Jean-Philippe</given-names>
</name>
<xref ref-type="aff" rid="btv683-AFF2">
<sup>2</sup>
</xref>
<xref ref-type="aff" rid="btv683-AFF3">
<sup>3</sup>
</xref>
<xref ref-type="aff" rid="btv683-AFF4">
<sup>4</sup>
</xref>
</contrib>
<aff id="btv683-AFF1">
<sup>1</sup>
</aff>
<aff id="btv683-AFF2">
<sup>2</sup>
</aff>
<aff id="btv683-AFF3">
<sup>3</sup>
</aff>
<aff id="btv683-AFF4">
<sup>4</sup>
</aff>
</contrib-group>
<author-notes>
<corresp id="btv683-COR1">*To whom correspondence should be addressed.</corresp>
<fn id="btv683-FN1">
<p>
<sup></sup>
The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.</p>
</fn>
<fn id="FN1">
<p>Associate Editor: Inanc Birol</p>
</fn>
</author-notes>
<pub-date pub-type="ppub">
<day>01</day>
<month>4</month>
<year>2016</year>
</pub-date>
<pub-date pub-type="epub">
<day>20</day>
<month>11</month>
<year>2015</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>20</day>
<month>11</month>
<year>2015</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>32</volume>
<issue>7</issue>
<fpage>1023</fpage>
<lpage>1032</lpage>
<history>
<date date-type="received">
<day>4</day>
<month>6</month>
<year>2015</year>
</date>
<date date-type="rev-recd">
<day>26</day>
<month>10</month>
<year>2015</year>
</date>
<date date-type="accepted">
<day>13</day>
<month>11</month>
<year>2015</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author 2015. Published by Oxford University Press.</copyright-statement>
<copyright-year>2015</copyright-year>
<license xlink:href="http://creativecommons.org/licenses/by-nc/4.0/" license-type="creative-commons">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/4.0/">http://creativecommons.org/licenses/by-nc/4.0/</ext-link>
), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com</license-p>
</license>
</permissions>
<abstract>
<p>
<bold>Motivation:</bold>
Metagenomics characterizes the taxonomic diversity of microbial communities by sequencing DNA directly from an environmental sample. One of the main challenges in metagenomics data analysis is the binning step, where each sequenced read is assigned to a taxonomic clade. Because of the large volume of metagenomics datasets, binning methods need fast and accurate algorithms that can operate with reasonable computing requirements. While standard alignment-based methods provide state-of-the-art performance, compositional approaches that assign a taxonomic class to a DNA read based on the
<italic>k</italic>
-mers it contains have the potential to provide faster solutions.</p>
<p>
<bold>Results:</bold>
We propose a new rank-flexible machine learning-based compositional approach for taxonomic assignment of metagenomics reads and show that it benefits from increasing the number of fragments sampled from reference genome to tune its parameters, up to a coverage of about 10, and from increasing the
<italic>k</italic>
-mer size to about 12. Tuning the method involves training machine learning models on about 10
<sup>8</sup>
samples in 10
<sup>7</sup>
dimensions, which is out of reach of standard softwares but can be done efficiently with modern implementations for large-scale machine learning. The resulting method is competitive in terms of accuracy with well-established alignment and composition-based tools for problems involving a small to moderate number of candidate species and for reasonable amounts of sequencing errors. We show, however, that machine learning-based compositional approaches are still limited in their ability to deal with problems involving a greater number of species and more sensitive to sequencing errors. We finally show that the new method outperforms the state-of-the-art in its ability to classify reads from species of lineage absent from the reference database and confirm that compositional approaches achieve faster prediction times, with a gain of 2–17 times with respect to the BWA-MEM short read mapper, depending on the number of candidate species and the level of sequencing noise.</p>
<p>
<bold>Availability and implementation:</bold>
Data and codes are available at
<ext-link ext-link-type="uri" xlink:href="http://cbio.ensmp.fr/largescalemetagenomics">http://cbio.ensmp.fr/largescalemetagenomics</ext-link>
.</p>
<p>
<bold>Contact:</bold>
<email>pierre.mahe@biomerieux.com</email>
</p>
<p>
<bold>Supplementary information:</bold>
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/lookup/suppl/doi:10.1093/bioinformatics/btv683/-/DC1">Supplementary data</ext-link>
are available at
<italic>Bioinformatics</italic>
online.</p>
</abstract>
<counts>
<page-count count="10"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec>
<title>1 Introduction</title>
<p>Recent progress in next-generation sequencing technologies allow to access large amounts of genomic data within a few hours at a reasonable cost (
<xref rid="btv683-B31" ref-type="bibr">Soon
<italic>et</italic>
<italic>al.</italic>
, 2013</xref>
). In metagenomics, next-generation sequencing is used to analyze the genomic content of microbial communities by sequencing all DNA present in an environmental sample (
<xref rid="btv683-B27" ref-type="bibr">Riesenfeld
<italic>et</italic>
<italic>al.</italic>
, 2004</xref>
). It gives access to all organisms present in the sample even if they do not grow on culture media (
<xref rid="btv683-B8" ref-type="bibr">Hugenholtz
<italic>et</italic>
<italic>al.</italic>
, 2002</xref>
) and allows us to characterize with an unprecedented level of resolution the diversity of the microbial realm (
<xref rid="btv683-B24" ref-type="bibr">Peterson
<italic>et</italic>
<italic>al.</italic>
, 2009</xref>
).</p>
<p>The raw output of a metagenomics experiment is a large set of short DNA sequences (reads) obtained by high-throughput sequencing of the DNA present in the sample. There exist two main approaches to analyze these data, corresponding to slightly different goals. On the one hand,
<italic>taxonomic profiling</italic>
aims to estimate the relative abundance of the members of the microbial community, without necessarily assigning each read to a taxonomic class. Recent works like WGSQuikr (
<xref rid="btv683-B11" ref-type="bibr">Koslicki
<italic>et</italic>
<italic>al.</italic>
, 2014</xref>
) or GASIC (
<xref rid="btv683-B15" ref-type="bibr">Lindner and Renard, 2012</xref>
) proved to be very efficient for this purpose.
<italic>Taxonomic binning</italic>
methods, on the other hand, explicitly assign each read to a taxonomic clade. This process can be unsupervised, relying on clustering methods to assign reads to several bins (i.e. clusters), or supervised, in which case reads are individually assigned to nodes of the taxonomy (
<xref rid="btv683-B17" ref-type="bibr">Mande
<italic>et</italic>
<italic>al.</italic>
, 2012</xref>
). While binning is arguably more challenging than profiling, it is a necessary step for downstream applications which require draft-genome reconstruction. This may notably be the case in a diagnostics context, where further analyses could aim to detect pathogen microorganisms (
<xref rid="btv683-B21" ref-type="bibr">Miller
<italic>et</italic>
<italic>al.</italic>
, 2013</xref>
) or antibiotic resistance mechanisms (
<xref rid="btv683-B29" ref-type="bibr">Schmieder and Edwards, 2012</xref>
).</p>
<p>In this article, we focus on the problem of supervised taxonomic binning, where we wish to assign each read in a metagenomics sample to a node of a pre-defined taxonomy. Two main computational strategies have been proposed for that purpose: (i) alignment-based approaches, where the read is searched against a reference sequence database with sequence alignment tools like BLAST (
<xref rid="btv683-B9" ref-type="bibr">Huson
<italic>et</italic>
<italic>al.</italic>
, 2007</xref>
) or short read mapping tools (e.g. BWA,
<xref rid="btv683-B14" ref-type="bibr">Li and Durbin, 2009</xref>
) and (ii) compositional approaches, where a machine learning model such as a naive Bayes (NB) classifier (
<xref rid="btv683-B22" ref-type="bibr">Parks
<italic>et</italic>
<italic>al</italic>
., 2011</xref>
;
<xref rid="btv683-B32" ref-type="bibr">Wang
<italic>et</italic>
<italic>al</italic>
., 2007</xref>
) or a support vector machine (SVM,
<xref rid="btv683-B20" ref-type="bibr">McHardy
<italic>et</italic>
<italic>al.</italic>
, 2007</xref>
;
<xref rid="btv683-B23" ref-type="bibr">Patil
<italic>et</italic>
<italic>al.</italic>
, 2012</xref>
) is trained to label the read based on the set of
<italic>k</italic>
-mers it contains. Recently, a very fast compositional approach using long
<italic>k</italic>
-mers and not based on machine learning models, called Kraken (
<xref rid="btv683-B33" ref-type="bibr">Wood and Salzberg, 2014</xref>
), has also been proposed. Since the taxonomic classification of a sequence by compositional approaches is only based on the set of
<italic>k</italic>
-mers it contains, they can offer significant gain in terms of classification time over similarity-based approaches. Training a machine learning model for taxonomic binning can, however, be computationally challenging. Indeed, compositional approaches must be trained on a set of sequences with known taxonomic labels, typically obtained by sampling error-free fragments from reference genomes. In the case of NB classifiers, explicit sampling of fragments from reference genomes is not needed to train the model: instead, a global profile of
<italic>k</italic>
-mer abundance from each reference genome is sufficient to estimate the parameters of the NB model, leading to simple and fast implementations (
<xref rid="btv683-B22" ref-type="bibr">Parks
<italic>et</italic>
<italic>al</italic>
., 2011</xref>
;
<xref rid="btv683-B28" ref-type="bibr">Rosen
<italic>et</italic>
<italic>al</italic>
., 2011</xref>
;
<xref rid="btv683-B32" ref-type="bibr">Wang
<italic>et</italic>
<italic>al</italic>
., 2007</xref>
). On the other hand, in the case of SVM and related discriminative methods, an explicit sampling of fragments from reference genomes to train the model based on the
<italic>k</italic>
-mer content of each fragment is needed, which can be a limitation for standard SVM implementations. For example,
<xref rid="btv683-B23" ref-type="bibr">Patil
<italic>et</italic>
<italic>al.</italic>
(2012)</xref>
sampled approximately 10 000 fragments from 1768 genomes to train a structured SVM (based on a
<italic>k</italic>
-mer representation with
<italic>k</italic>
 = 4, 5, 6) and reported an accuracy competitive with similarity-based approaches. Increasing the number of fragments sampled to train a SVM may improve its accuracy and allow us to investigate larger values of
<italic>k.</italic>
However, it also raises computational challenges, as it involves machine learning problems where a model must be trained from potentially millions or billions of training examples, each represented by a vector in 10
<sup>7</sup>
dimensions for, e.g.
<italic>k</italic>
 = 12.</p>
<p>In this work, we investigate the potential of compositional approaches for taxonomic label assignment using modern, large-scale machine learning algorithms. We propose a new, rank-flexible compositional approach trained with large-scale machine learning methods and assess its performance in different situations. We show that it provides an interesting trade-off in speed and accuracy compared to the state-of-the-art, particularly when confronted to species absent from the reference database, and for a moderate number of candidate species.</p>
</sec>
<sec>
<title>2 Methods</title>
<sec>
<title>2.1 Linear models for read classification</title>
<p>In most of compositional metagenomics applications, a sequence is represented by its
<italic>k</italic>
-mer profile, namely, a vector counting the number of occurrences of any possible word of
<italic>k</italic>
letters in the sequence. Only the
<inline-formula>
<mml:math id="MM1">
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>T</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>C</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>G</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
nucleotides are usually considered to define
<italic>k</italic>
-mer profiles, that are therefore
<inline-formula>
<mml:math id="MM2">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
-dimensional vectors. Although the size of the
<italic>k</italic>
-mer profile of a sequence of length
<italic>l</italic>
increases exponentially with
<italic>k</italic>
, it contains at most
<inline-formula>
<mml:math id="MM3">
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
non-zero elements since a sequence of length
<italic>k</italic>
contains
<inline-formula>
<mml:math id="MM4">
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
different
<italic>k</italic>
-mers.</p>
<p>Given a sequence represented by its
<italic>k</italic>
-mer profile
<inline-formula>
<mml:math id="MM5">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo></mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
, we consider linear models to assign it to one of
<italic>K</italic>
chosen taxonomic classes. A linear model is a set of weight vectors
<inline-formula>
<mml:math id="MM6">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
that assign
<italic>x</italic>
to the class
<disp-formula id="btv683-M1">
<label>(1)</label>
<mml:math id="MM7">
<mml:mrow>
<mml:mi>arg</mml:mi>
<mml:munder>
<mml:mrow>
<mml:mi>max</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msubsup>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mi>x</mml:mi>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
where
<inline-formula>
<mml:math id="MM8">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msup>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
is the standard inner product between vectors. To train the linear model, we start from a training set of sequences
<inline-formula>
<mml:math id="MM9">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
with known taxonomic labels
<inline-formula>
<mml:math id="MM10">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
. An NB classifier, e.g. is a linear model where the weights are estimated from the
<italic>k</italic>
-mer count distributions on each class. Another class of linear models popular in machine learning, which include SVM, are the discriminative approaches that learn the weights by solving an optimization problem which aims to separate the training data of each class from each other. More precisely, to optimize the weight
<italic>w
<sub>j</sub>
</italic>
of the
<italic>j</italic>
th class, one typically assigns a binary label
<italic>y
<sub>i</sub>
</italic>
to each training example (
<italic>y
<sub>i</sub>
</italic>
 = 1 if
<italic>c
<sub>i</sub>
</italic>
 = 
<italic>j</italic>
or
<inline-formula>
<mml:math id="MM11">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
otherwise) and solves an optimization problem of the form
<disp-formula id="btv683-M2">
<label>(2)</label>
<mml:math id="MM12">
<mml:mrow>
<mml:munder>
<mml:mrow>
<mml:mi>min</mml:mi>
</mml:mrow>
<mml:mi>w</mml:mi>
</mml:munder>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:munderover>
<mml:mi></mml:mi>
</mml:mstyle>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msup>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>λ</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
where
<inline-formula>
<mml:math id="MM13">
<mml:mrow>
<mml:mi></mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>y</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
is a loss function quantifying how ‘good’ the prediction
<italic>t</italic>
is if the true label is
<italic>y</italic>
, and
<inline-formula>
<mml:math id="MM14">
<mml:mrow>
<mml:mi>λ</mml:mi>
<mml:mo></mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
is a regularization parameter to tune, helpful to prevent overfitting in high dimension. An SVM solves (2) with the hinge loss
<inline-formula>
<mml:math id="MM15">
<mml:mrow>
<mml:mi></mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>y</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mtext>max</mml:mtext>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>y</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
, but other losses such as the logistic loss
<inline-formula>
<mml:math id="MM16">
<mml:mrow>
<mml:mi></mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>y</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>log</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:mi>exp</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mo></mml:mo>
<mml:mi>y</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
or the squared loss
<inline-formula>
<mml:math id="MM17">
<mml:mrow>
<mml:mi></mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>y</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>y</mml:mi>
<mml:mo></mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
are also possible and often lead to models with similar accuracies. These models have met significant success in numerous real-world learning tasks, including compositional metagenomics (
<xref rid="btv683-B23" ref-type="bibr">Patil
<italic>et</italic>
<italic>al.</italic>
, 2012</xref>
). In this work, we use the squared loss function and choose
<italic>λ</italic>
 = 0, a setting that seemed appropriate from preliminary experiments.</p>
</sec>
<sec>
<title>2.2 Large-scale learning of linear models</title>
<p>Although learning linear models by solving (2) is now a mature technology implemented in numerous softwares, metagenomics applications raise computational challenges for most standard implementations, due to the large values that
<italic>n</italic>
(number of reads in the training set),
<inline-formula>
<mml:math id="MM18">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
(dimension of the models) and
<italic>K</italic>
(number of taxonomic classes) can take.</p>
<p>The training set is typically obtained by sampling fragments from reference genomes with known taxonomic class. For example,
<xref rid="btv683-B23" ref-type="bibr">Patil
<italic>et</italic>
<italic>al.</italic>
(2012)</xref>
sampled approximately
<italic>n</italic>
 = 10 000 fragments from 1768 genomes to train SVM models based on
<italic>k</italic>
-mer profiles of size
<italic>k</italic>
 = 4, 5, 6. However, the number of distinct fragments that may be drawn from a genome sequence is approximately equal to its length (by sampling a fragment starting at each position in the genome), hence can reach several millions for each microbial genome, leading to potentially billions of training sequences when thousands of reference genomes are used. While considering every possible fragment from every possible genome may not be the best choice because of the possible redundancy between the reads, it may still be useful to consider a significant number of fragments to properly account for the intra and inter species genomic variability. Similarly, exploring models with
<italic>k</italic>
larger than 6, say 10 or 15, may be interesting but requires (i) the capacity to manipulate the corresponding
<inline-formula>
<mml:math id="MM19">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
-dimensional vectors (
<inline-formula>
<mml:math id="MM20">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mn>15</mml:mn>
</mml:mrow>
</mml:mrow>
</mml:msup>
<mml:mo></mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>9</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
) and (ii) large training sets since many examples are needed to learn a model in high dimension. Finally, real-life applications involving actual environmental samples may contain several hundreds microbial species, casting the problem into a relatively massive multiclass scenario out of reach of most standard implementations of SVM.</p>
<p>To solve (2) efficiently when
<italic>n</italic>
,
<italic>k</italic>
and
<italic>K</italic>
take large values, we use a dedicated implementation of stochastic gradient descent (SGD
<xref rid="btv683-B5" ref-type="bibr">Bottou, 1998</xref>
) available in the Vowpal Wabbit software (VW,
<xref rid="btv683-B1" ref-type="bibr">Agarwal
<italic>et al</italic>
., 2014</xref>
;
<xref rid="btv683-B12" ref-type="bibr">Langford
<italic>et</italic>
<italic>al</italic>
., 2007</xref>
). In short, SGD exploits the fact that the objective function in (2) is an average of
<italic>n</italic>
terms, one for each training example, to approximate the gradient at each step using a single, randomly chosen term. Although SGD requires more steps to converge to the solution than standard gradient descent, each step is
<italic>n</italic>
times faster and the method is overall faster and more scalable. In addition, although the dimension
<inline-formula>
<mml:math id="MM21">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
of the data is large, VW exploits the fact that each training example is sparse, leading to efficient memory storage and fast updates at each SGD step. We refer the interested reader to
<xref rid="btv683-B6" ref-type="bibr">Bottou (2010)</xref>
for more discussion about the relevance of SGD in large-scale learning. In practice, VW can train a model with virtually no limit on
<italic>n</italic>
as long as the data can be stored on a disk (they are not loaded in memory). As for
<italic>k</italic>
, VW can handle up to 2
<sup>32</sup>
distinct features, and the count of each
<italic>k</italic>
-mer is randomly mapped to one feature by a hash table. This means that we have virtually no limit on
<italic>k</italic>
, except that when
<italic>k</italic>
approaches or exceeds the limit (such that
<inline-formula>
<mml:math id="MM22">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mn>32</mml:mn>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
, i.e.
<italic>k</italic>
 = 16), collisions will appear in the hash table and different
<italic>k</italic>
-mers will be counted together, which may impact the performance of the model.</p>
</sec>
<sec>
<title>2.3 Rank-specific and rank-flexible read classification</title>
<p>The classification approach described in Section 2.1 can be readily applied by labelling sampled fragments according to a given taxonomic rank and learning read classification models tailored to this level of resolution, which is sometimes referred to as
<italic>rank-specific</italic>
approaches (
<xref rid="btv683-B22" ref-type="bibr">Parks
<italic>et</italic>
<italic>al.</italic>
, 2011</xref>
). We build such rank-specific classifiers at the species, genus and family levels.</p>
<p>In addition, we implement a
<italic>rank-flexible</italic>
classifier to automatically choose the most adequate level where a read should be classified in the taxonomy or leave it unclassified if it looks too different from the reads used to train the model. For that purpose, we assess the reliability of a rank-specific prediction at any level by means of two criteria: the maximum score returned by the linear model (1) and the difference between the two largest scores. According to the terminology proposed by
<xref rid="btv683-B7" ref-type="bibr">Gammerman and Vovk (2007)</xref>
, the former criterion accounts for the
<italic>credibility</italic>
of the prediction: if the sequence is not granted a sufficient score for any class, it may be considered unusual with respect to the training dataset. The latter criterion accounts for the
<italic>confidence</italic>
of the prediction: if the scores of the two top-scoring classes are comparable, both classes may be considered plausible. By combining both criteria we can reject predictions that are unlikely or ambiguous. To combine rank-specific classifiers into a rank-flexible one, we start from the model built at the lowest rank—species in our case—and iteratively allow a rejected read to be classified at the upper rank. If a read is rejected by all rank-specific models considered, it is left unclassified.</p>
<p>The reject option mechanism underlying this rank-flexible procedure heavily depends on thresholds on the maximum score of the linear model and on the difference between the two largest scores, which can be set globally or on a taxon-by-taxon basis. A strategy to optimize these thresholds is described in
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/lookup/suppl/doi:10.1093/bioinformatics/btv683/-/DC1">Supplementary Materials</ext-link>
(Section 2), together with an illustration of the trade-offs that can be achieved in terms of recall and precision.</p>
</sec>
</sec>
<sec>
<title>3 Data</title>
<p>We assemble three databases of genomes, which we refer to below as the
<italic>small</italic>
, the
<italic>medium</italic>
and the
<italic>large</italic>
databases. Each comprises a set of
<italic>reference</italic>
genomes to train the models and a set of
<italic>validation</italic>
genomes from which reads are generated to evaluate the performance of the different classification methods.</p>
<p>The
<italic>small</italic>
database contains as references 356 complete genome sequences covering 51 bacterial species, listed in
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/lookup/suppl/doi:10.1093/bioinformatics/btv683/-/DC1">Supplementary Table S1</ext-link>
. For validation, we choose 52 genomes not present in the reference database but originating from one of the 51 species (two genomes are indeed available for the
<italic>Francisella tularensis</italic>
species, one of which originating from the
<italic>novicida</italic>
subspecies). This small database is of limited biological interest but is convenient to extensively test the different methods and vary their parameters.</p>
<p>The
<italic>medium</italic>
and
<italic>large</italic>
databases are meant to represent more realistic situations, involving a larger number of candidate bacterial species and a larger number of reference genomes. We download the 5201 complete bacterial and archeal genomes available in the NCBI RefSeq database as of July 2014 (
<xref rid="btv683-B25" ref-type="bibr">Pruitt
<italic>et</italic>
<italic>al.</italic>
, 2012</xref>
), by means of a functionality embedded in the Fragment Classification Package (
<xref rid="btv683-B22" ref-type="bibr">Parks
<italic>et</italic>
<italic>al.</italic>
, 2011</xref>
). We then filter these sequences according to a criterion proposed in
<xref rid="btv683-B22" ref-type="bibr">Parks
<italic>et</italic>
<italic>al.</italic>
(2011)</xref>
, only keeping genomes that belong to genera represented by at least three species. We also remove genomes represented by less than 10
<sup>6</sup>
nucleotides to filter draft genome sequences, plasmids, phages, contigs and other short sequences. The 2961 remaining genomes originate from 774 species, among which 193 are represented by at least two strains and 110 by at least three strains. To build the
<italic>medium</italic>
database, we randomly pick one strain within each of the 110 species with at least three strains as a validation set and combine all other sequences in the 193 species with at least two strains as reference database. The rationale of this split is to ensure that each species in the reference database is represented by at least two strains, which allows to optimize the values of the thresholds involved in the classification procedure on a taxon-by-taxon basis, by means of an internal validation process. To define the
<italic>large</italic>
database, we randomly pick one strain within each of the 193 species represented by at least two strains to define the validation database and keep the remaining sequences of the 774 species as reference database. This ensures that each strain in the validation set comes from a species that is present in the reference database.</p>
<p>In addition, we create two additional validation sets to assess the performance of models trained on the
<italic>large</italic>
reference database in more challenging situations. First, we assemble a
<italic>novel lineage</italic>
validation set composed of genomes in the NCBI RefSeq database from species not represented in the large reference database. Details about the number of strains in the
<italic>novel lineage</italic>
validation set are provided in
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/lookup/suppl/doi:10.1093/bioinformatics/btv683/-/DC1">Supplementary Table S2</ext-link>
. Second, we consider a
<italic>real</italic>
dataset composed of actual reads, the HMP Microbial Mock Community dataset (Even, Low Concentration, 454 GS FLX Titanium, SRA accession SRX030841). This dataset contains 1 386 198 reads coming from a mock sample made of a genomic DNA mixture obtained from 20 bacterial and 1 archaeal strains [see
<xref rid="btv683-B19" ref-type="bibr">Martin
<italic>et</italic>
<italic>al</italic>
. (2012)</xref>
for further details about the considered strains]. Reads with quality score below 20 are trimmed, and reads shorter than 25 bp are filtered out.</p>
</sec>
<sec>
<title>4 Results</title>
<sec>
<title>4.1 Proof of concept on the
<italic>small</italic>
database</title>
<p>In this section, we present a study on the
<italic>small</italic>
dataset, aiming to evaluate the impact of increasing the number of fragments used to train the model as well as the length of the
<italic>k</italic>
-mers considered. For that purpose, we consider a rank-specific setting defined at the species level, without any reject option mechanism, and learn several classification models based on fragments of length
<italic>L</italic>
 = 200 or
<italic>L</italic>
 = 400 sampled from the 356 reference genomes in the
<italic>small</italic>
reference database, represented by
<italic>k</italic>
-mers of size in
<inline-formula>
<mml:math id="MM23">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mn>4</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>6</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>8</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>10</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>12</mml:mn>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
. The number of fragments used to learn the models is gradually increased by drawing several ‘batches’ of fragments to cover, on average, each nucleotide of the reference genomes a pre-defined number of times
<italic>c.</italic>
We vary the coverage
<italic>c</italic>
between 0.1 to its maximal value, equal to the length of the fragments considered. This leads to learning models from around
<inline-formula>
<mml:math id="MM24">
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>2.7</mml:mn>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
, for
<italic>c</italic>
 = 0.1 and
<italic>L</italic>
 = 400, up to around
<inline-formula>
<mml:math id="MM25">
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1.1</mml:mn>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>9</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
fragments, when
<italic>c</italic>
reaches its maximal value. This is way beyond the configurations considered for instance in (
<xref rid="btv683-B23" ref-type="bibr">Patil
<italic>et al.</italic>
, 2012</xref>
), where SVM models were learned from approximately 10
<sup>4</sup>
fragments drawn from 1768 genomes.</p>
<p>To assess the performance of these models, we consider two sets of 134 319 fragments, of respective length 200 and 400, drawn from the 52 complete genomes that are not in the reference database used to train the models. Performance is measured in terms of species-level recall. We first compute the prediction recall within each species, i.e. the proportion of fragments originating from this species that are correctly classified and consider the average recall observed across species. In a multiclass setting, this indicator is indeed less biased toward over-represented classes than the global rate of correct classification.</p>
<p>
<xref ref-type="fig" rid="btv683-F1">Figure 1</xref>
shows the performance reached by models based on fragments of length 200 (left) or 400 (right), for different values of
<italic>k</italic>
(horizontal axis) and different coverages (different colors). We first note that for
<italic>c</italic>
 = 0.1, i.e. for a limited number of fragments, the classification performance starts by increasing with the size of the
<italic>k</italic>
-mers (up to
<italic>k</italic>
 = 8 and
<italic>k</italic>
 = 10 for fragments of length 200 and 400, respectively) and subsequently decreases. This suggests that the number of fragments considered in this setting is not sufficient to efficiently learn when the dimensionality of the feature space becomes too large. Note that twice as many fragments of length 200 as fragments of size 400 are drawn for a given coverage value, which may explain why performance still increases beyond
<italic>k</italic>
 = 8 with smaller fragments. Increasing the number of fragments confirms this hypothesis: performance systematically increases or remains steady with
<italic>k</italic>
for
<inline-formula>
<mml:math id="MM26">
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
and for
<inline-formula>
<mml:math id="MM27">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>8</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
, the performance is significantly higher than that obtained at
<italic>c</italic>
 = 0.1, for both length of fragments. Increasing the coverage from
<italic>c</italic>
 = 1 to
<italic>c</italic>
 = 10 has a positive impact in both cases, although marginally for fragments of length 400. Further increasing the number of fragments does not bring any noticeable improvement.
<fig id="btv683-F1" orientation="portrait" position="float">
<label>Fig. 1.</label>
<caption>
<p>
<bold>Increasing the number of fragments and
<italic>k</italic>
-mer size on the
<italic>small</italic>
datasets</bold>
. Left:
<italic>L</italic>
 = 200 bp fragments. Right:
<italic>L</italic>
 = 400 bp fragments. These figures show the average species-level recall obtained by linear predictors trained with Vowpal Wabbit from fragments covering each reference genome with a mean coverage
<italic>c</italic>
from 0.1 to
<italic>L</italic>
. Performances are reported as a function of
<italic>k</italic>
-mer sizes</p>
</caption>
<graphic xlink:href="btv683f1p"></graphic>
</fig>
</p>
<p>Altogether, the optimal configuration on this
<italic>small</italic>
dataset involves
<italic>k</italic>
-mers of size 12 and drawing fragments at a coverage
<inline-formula>
<mml:math id="MM28">
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mo></mml:mo>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
for the two lengths of fragments considered. Further increasing the size of the
<italic>k</italic>
-mers did not bring improvements and actually proved to be challenging. Indeed, as mentioned above, VW proceeds by hashing the input features into a vector offering at most 2
<sup>32</sup>
entries. This hashing operation can induce collisions between features, which can be detrimental to the model if the number of features becomes too high with respect to the size of the hash table. This issue is even more stressed in a multiclass setting, where the number of hash table entries available per model is divided by the number of classes considered. On this dataset, 51 models have to be stored in the hash table, which reduces the number of entries available per model to
<inline-formula>
<mml:math id="MM29">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mn>32</mml:mn>
</mml:mrow>
</mml:mrow>
</mml:msup>
<mml:mo>/</mml:mo>
<mml:mn>51</mml:mn>
<mml:mo></mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mn>32</mml:mn>
<mml:mo></mml:mo>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:mrow>
</mml:msup>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mn>13</mml:mn>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
. We have empirically observed that performance could not increase for
<italic>k</italic>
greater than 12 and actually decreased for
<italic>k</italic>
-mers greater than 15.</p>
<p>We now compare these results to two well-established approaches: a comparative approach based on the BWA-MEM sequence aligner (
<xref rid="btv683-B13" ref-type="bibr">Li, 2013</xref>
) and a compositional approach based on the generative NB classifier (
<xref rid="btv683-B28" ref-type="bibr">Rosen
<italic>et al.</italic>
, 2011</xref>
). The NB experiments rely on the Fragment Classification Package implementation (
<xref rid="btv683-B22" ref-type="bibr">Parks
<italic>et al.</italic>
, 2011</xref>
) and are carried out in the same setting as VW: we compute profiles of
<italic>k</italic>
-mers abundance for the 356 genomes of the reference database and use them to assign test fragments to their most likely genome. BWA-MEM is configured to solely return hits with maximal score (option
<monospace>-T</monospace>
<monospace>0</monospace>
). Unmapped fragments are counted as misclassifications, and a single hit is randomly picked in case of multiple hits, to obtain a species-level prediction. This latter random hit selection process is repeated 20 times, and the performance indicator reported below corresponds to its median value obtained across repetitions. Results are shown in
<xref ref-type="fig" rid="btv683-F2">Figure 2</xref>
. We first note that
<italic>k</italic>
-mer-based approaches, either generative or discriminative, never outperform the alignment-based approach. Comparable results are nevertheless obtained for
<italic>k</italic>
 = 12 with VW, while the NB shows slightly lesser performance (around 4 points in both cases). Performances obtained for shorter
<italic>k</italic>
-mers are markedly lower than that obtained by BWA-MEM. We note finally that VW generally outperforms the NB classifier, except for small
<italic>k</italic>
-mers (
<inline-formula>
<mml:math id="MM30">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
).
<fig id="btv683-F2" orientation="portrait" position="float">
<label>Fig. 2.</label>
<caption>
<p>
<bold>Comparison between Vowpal Wabbit and reference methods on the
<italic>small</italic>
datasets</bold>
. Left:
<italic>L</italic>
 = 200 bp fragments. Right:
<italic>L</italic>
 = 400 bp fragments. These figures show the average species-level recall obtained by linear predictors trained with Vowpal Wabbit from fragments covering each reference genome with a mean coverage equal to 10 (solid line). Performances are reported as afunction of
<italic>k</italic>
-mer sizes. This approach is compared to the standard compositional NB approach (dashed line) and an alignment-based approach based on BWA (dash-dotted line)</p>
</caption>
<graphic xlink:href="btv683f2p"></graphic>
</fig>
</p>
<p>In summary, these experiments demonstrate the relevance and feasibility of large-scale machine learning for taxonomic binning: we obtain a performance comparable to that of the well-established alignment-based approach, provided a sufficient number of fragments and long enough
<italic>k</italic>
-mers are considered to learn the
<italic>k</italic>
-mers-based predictive models.</p>
</sec>
<sec>
<title>4.2 Evaluation on the
<italic>medium</italic>
and large reference databases</title>
<p>We now proceed to a more realistic evaluation involving a larger number of candidate microbial species and a larger number of reference genomes, using the
<italic>medium</italic>
and
<italic>large</italic>
reference databases. We learn classification models based on results obtained on the
<italic>small</italic>
database: we consider
<italic>k</italic>
-mers of size 12 and a number of fragments allowing to cover each base of the reference genomes 10 times in average. We limit our analysis to fragments of length 200, which leads to models learned from around
<inline-formula>
<mml:math id="MM31">
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1.38</mml:mn>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>8</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
and
<inline-formula>
<mml:math id="MM32">
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>2.56</mml:mn>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>8</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
fragments for the
<italic>medium</italic>
and
<italic>large</italic>
reference databases, respectively. Note that due to the larger number of species involved, around
<inline-formula>
<mml:math id="MM33">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mn>32</mml:mn>
</mml:mrow>
</mml:mrow>
</mml:msup>
<mml:mo>/</mml:mo>
<mml:mn>193</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
≃ 4
<sup>12</sup>
and
<inline-formula>
<mml:math id="MM34">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mn>32</mml:mn>
</mml:mrow>
</mml:mrow>
</mml:msup>
<mml:mo>/</mml:mo>
<mml:mn>774</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
≃ 4
<sup>11</sup>
entries of the VW hash table are available per model for each of these reference databases. We evaluate the performance of the models on fragments extracted from the genomes of the corresponding validation databases and draw a number of fragments necessary to cover each base of each genome once in average, which represents around
<inline-formula>
<mml:math id="MM35">
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
and
<inline-formula>
<mml:math id="MM36">
<mml:mrow>
<mml:mn>3.5</mml:mn>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
sequences for the
<italic>medium</italic>
and
<italic>large</italic>
databases, respectively.</p>
<p>We consider the
<italic>rank-specific</italic>
(at the species level) and
<italic>rank-flexible</italic>
VW strategies, as explained in Section 2.3, and compare them to BWA-MEM and NB, also set to work in a species-level rank-specific setting, as well as to the rank-flexible Kraken method trained on the same reference database (
<xref rid="btv683-B33" ref-type="bibr">Wood and Salzberg, 2014</xref>
). We assess the classification performance in terms of precision and recall. For a given species, recall (or sensitivity) is defined as the proportion of test sequences originating from this species that are classified as such. Precision (or positive predictive value) corresponds to the proportion of test sequences actually originating from this species among sequences classified as such. To compare the classification performance of the various systems, we compute the precision and recall observed for each species of the validation dataset and report their average value. We also report the average species-level F-measure, defined as the harmonic mean of precision and recall. For rank-flexible strategies, we consider a second definition of recall, referred to as the upper recall, and the corresponding upper F-measure, in which a sequence is considered to be correctly predicted if it is classified into a ancestor taxon of its species.</p>
<p>Results are shown in
<xref ref-type="fig" rid="btv683-F3">Figure 3</xref>
. We first note that for the
<italic>medium</italic>
reference database, rank-specific VW and BWA-MEM performances are very similar in terms of species-level recall (mean value of 86.2% and 86.8%, respectively). The NB classifier, on the other hand, has a lower species-level recall, with 6 points less than the alignment-based approach. Rank-flexible strategies also exhibit a lower species-level recall but offer a higher level of upper recall, with Kraken providing slightly better performance than VW (1 point in terms of species-level recall and 3 points in upper recall). On the other hand, rank-flexible strategies offer a significantly higher precision than the three rank-specific approaches, which reach comparable values. This is due to the rank-flexible ability to reject predictions and to classify reads at upper ranks, which positively impacts the precision, measured at the species-level. Interestingly, we note that while both rank-flexible approaches reject the same amount of predictions (around 5% per species in average) and classify the same amount of fragments at the species-level (around 84% per species in average), Kraken shows a higher precision than VW (97.5% vs. 95%). These results therefore indicate that trade-offs can be met in terms of precision and recall. Rank-specific approaches based on BWA and VW can indeed offer a higher species-level recall, at the price of classification errors hence of precision. Rank-flexible strategies, on the other hand, manage to maintain a high level of precision, at the price of a lower species-level recall. This drop is, however, counterbalanced by the flexibility of classifying sequences at upper ranks, leading to a higher upper-recall than the species-level recall of rank-specific approaches.
<fig id="btv683-F3" orientation="portrait" position="float">
<label>Fig. 3.</label>
<caption>
<p>
<bold>Performance on the
<italic>medium</italic>
</bold>
<bold>and
<italic>large</italic>
</bold>
<bold>reference databases.</bold>
This figure shows the classification performance measured on genomic fragments in terms of average species-level recall, precision and F-measure, for the various classification strategies considered. For rank-flexible approaches, the average upper recall and upper F-measure are shown as white bars on top of the gray ones, representing species-level indicators</p>
</caption>
<graphic xlink:href="btv683f3p"></graphic>
</fig>
</p>
<p>Considering a larger number of candidate species in the
<italic>large</italic>
reference database proves to be challenging for methods based on short
<italic>k</italic>
-mers. Indeed, a drop of 10 points in terms of species-level recall is observed for the VW and NB rank-specific strategies, while BWA suffers of a lesser drop of around 6 points. Kraken turns out to be the most competitive approach compared to BWA. Indeed, Kraken species-level recall becomes comparable to that of rank-specific VW but still offers a greater precision and the flexibility of retrieving predictions at upper ranks. The rank-flexible strategy based on VW is less competitive on the
<italic>large</italic>
database than it was on the
<italic>medium</italic>
one.</p>
</sec>
<sec>
<title>4.3 Robustness to sequencing errors</title>
<p>The evaluation performed in the previous section is based on taxonomic classification of DNA fragments drawn from reference genomes without errors. In real life, sequencing errors may alter the read sequences and make the classification problem more challenging. To evaluate the robustness of the classifiers to sequencing errors, we generate new reads simulating three realistic types of sequencing errors: homopolymeric stretches, which are commonly encountered in pyrosequencing (e.g. Roche 454) and ion sequencing (e.g. IonTorrent) technologies, general mutations (substitutions and insertions/deletions) and an error model tailored to the Illumina MiSeq technology. For the two former models, we rely on the Grinder read simulation software (
<xref rid="btv683-B2" ref-type="bibr">Angly
<italic>et</italic>
<italic>al.</italic>
, 2012</xref>
). More precisely, we consider the Balzer homopolymeric error model meant to reproduce the Roche 454 technology (
<xref rid="btv683-B3" ref-type="bibr">Balzer
<italic>et</italic>
<italic>al.</italic>
, 2010</xref>
) and the 4th degree polynomial proposed by
<xref rid="btv683-B10" ref-type="bibr">Korbel
<italic>et</italic>
<italic>al.</italic>
(2009)</xref>
to study general mutations. In this latter case, however, we modify the parameters of the error model to reach a median error rate of 2%, in agreement with the results of the original publication (
<xref rid="btv683-B10" ref-type="bibr">Korbel
<italic>et</italic>
<italic>al.</italic>
, 2009</xref>
). The Illumina MiSeq error model was defined internally, according to a procedure described in
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/lookup/suppl/doi:10.1093/bioinformatics/btv683/-/DC1">Supplementary Materials</ext-link>
(Section 3). To compare the results of the fragment- and read-based evaluations, each dataset simulated from the
<italic>medium</italic>
and
<italic>large</italic>
validation databases includes as well around
<inline-formula>
<mml:math id="MM37">
<mml:mrow>
<mml:mn>3.5</mml:mn>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
or
<inline-formula>
<mml:math id="MM38">
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
sequences, respectively.</p>
<p>
<xref ref-type="fig" rid="btv683-F4">Figure 4</xref>
presents the performance obtained by the rank-specific and rank-flexible VW strategies and those obtained by Kraken. On the
<italic>medium</italic>
database, we note that the impact of these sequencing errors is usually limited, where a drop of at most one point is observed with respect to the results obtained on fragments, essentially due do a decrease of the recall. This is, however, not the case for the general mutation error model with the VW rank-flexible strategy, where a drop of 6 points in terms of species-level recall is observed. The upper recall, on the other hand, remains comparable, indicating that a larger number of predictions are made (correctly) at an upper rank in this case. Considering a larger number of species in the
<italic>large</italic>
database also has a limited impact for the VW rank-specific strategy and for Kraken, for the Balzer and Miseq error models (drop of <0.5 point with respect to the fragments results). The mutation error model is, however, more challenging, leading to a drop of species-level recall of 3 points for VW and 1.5 point for Kraken (reduced to a 0.7 point in terms of upper recall). The rank-flexible VW strategy, on the other hand, is more impacted. We note indeed a higher drop in terms of species-level recall, even for the Balzer and MiSeq error model (2 and 5 points, respectively). The situation is even worse for the mutation error model, where a drop of almost 12 points is observed. This is due, at least in part, to an increase of the rejection rate with this error model (13% per species in average vs. 8–10% for the same strategy on other error models and 7.5% with Kraken on the same error model). In any case, the drop in terms of upper recall is less severe, indicating that a larger fraction of predictions are made above the species level.
<fig id="btv683-F4" orientation="portrait" position="float">
<label>Fig. 4.</label>
<caption>
<p>
<bold>Robustness to sequencing errors.</bold>
Top:
<italic>medium</italic>
reference database; Bottom:
<italic>large</italic>
reference database. This figure shows the classification performance measured on simulated reads in terms of average species-level recall, precision and F-measure, for the various classification strategies considered. For rank-flexible approaches, the average upper recall and upper F-measure are shown as white bars on top of the gray ones, representing species level</p>
</caption>
<graphic xlink:href="btv683f4p"></graphic>
</fig>
</p>
<p>In summary, these results suggest that
<italic>k</italic>
-mer-based read classification models are robust to realistic sequencing noise. We note, however, that VW is globally more impacted than Kraken, especially in its rank-flexible setting and the general mutation error model. Additional experiments involving higher levels of noise, described in
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/lookup/suppl/doi:10.1093/bioinformatics/btv683/-/DC1">Supplementary Materials</ext-link>
(Section 4), confirm this observation.</p>
</sec>
<sec>
<title>4.4 Evaluation on novel lineages</title>
<p>Metagenomic samples encountered in real-life applications may include microorganisms not represented in the reference database used to carry out the taxonomic binning. This may be due to the presence of a previously unknown microorganism, which is common in environmental studies, or of a microorganism for which no reference genome is available. In this section, we thus evaluate the ability of taxonomic binning methods to classify reads coming from novel bacterial lineages at their appropriate taxonomic rank. For this purpose, we extract from the NCBI RefSeq database genomes of species not represented in the
<italic>large</italic>
reference database and qualify them according to the rank of their closest relative. For instance, a strain is said to be ‘reachable’ at the genus level when its species is not part of the reference database but when other species of the same genus are represented. To estimate the performance of classifiers for such novel lineages, we extract strains reachable at the genus, family, order, class and phylum levels, draw genomic fragments from their genomes and classify them using the rank-flexible VW strategy and Kraken, on the
<italic>large</italic>
reference database. We assess the performance in terms of the proportion of reads that are (i) assigned to the appropriate taxon, (ii) assigned to an ancestor taxon, (iii) assigned to a descendant taxon, (iv) rejected and (v) misclassified. In the two former cases, the prediction is considered to be correct, which allows to define criteria of recall by considering case (i) and upper recall by considering cases (i) and (ii). Predictions assigned to a descendant taxon are said to be too specific.</p>
<p>
<xref ref-type="fig" rid="btv683-F5">Figure 5</xref>
shows the results obtained for novel lineages reachable at the genus, family and order levels. Results obtained at other ranks are provided in
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/lookup/suppl/doi:10.1093/bioinformatics/btv683/-/DC1">Supplementary Materials</ext-link>
(Section 5). We can first note from panels A and B that both VW and Kraken reject a large proportion of the predictions (between 56% and more than 93% in average across reachable taxa). The reject rate increases with the taxonomic distance between the novel lineage and the reference database in both cases and is larger with Kraken, especially at the family level and above. The high reject rate of Kraken comes with a very moderate error rate (around 3% in average), while VW suffers from a much higher error rate (around 15%). We note moreover that the recall is very low at the family level and above, suggesting that it is difficult to effectively detect genomic proximity from
<italic>k</italic>
-mers at such taxonomic distance. A closer look at novel lineages reachable at the genus level, i.e. strains coming from species for which species of the same genus are available, which probably constitutes the most realistic scenario, highlights important differences between VW and Kraken. We note indeed that the recall is much higher with VW than Kraken: 15.3% vs. 4.2% on average across reachable genera and 19.4% vs. 4.6% in terms of upper recall. Conversely, Kraken classifies a greater proportion of fragments too specifically, assigning 26.3% of the fragments to a sibling species, instead of 10.3% with VW. This is illustrated in
<xref ref-type="fig" rid="btv683-F5">Figure 5</xref>
C, which compares the upper recall on a genus-by-genus basis, as well as in
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/lookup/suppl/doi:10.1093/bioinformatics/btv683/-/DC1">Supplementary Figure S9</ext-link>
, which further highlights the trade-off observed in terms of recall and proportion of too specific prediction.
<fig id="btv683-F5" orientation="portrait" position="float">
<label>Fig. 5.</label>
<caption>
<p>
<bold>VW and Kraken performance on novel lineages.</bold>
Panels (
<bold>A</bold>
) and (
<bold>B</bold>
) present the proportions of unclassified fragments, the recall (fragments assigned to their correct reachable taxon), the upper recall (fragments assigned to their correct reachable taxon or one of its ancestors), the proportion of too specific predictions (fragments assigned to a descendant of their reachable taxon) and the error rate (fragments assigned to a taxon not part of the branch of their reachable taxon). These results are shown for three reachable ranks (genus, family and order) and correspond to average values across reachable taxa of a given rank. Results obtained at higher ranks are provided in
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/lookup/suppl/doi:10.1093/bioinformatics/btv683/-/DC1">Supplementary Figure S7</ext-link>
. Panel (
<bold>C</bold>
) compares the VW and Kraken upper recall estimations for the 69 genera used to evaluate the performance on the new strains reachable at the genus level</p>
</caption>
<graphic xlink:href="btv683f5p"></graphic>
</fig>
</p>
<p>In summary, these results indicate first that the faculty of
<italic>k</italic>
-mers-based methods to effectively recognize novel genera and other more taxonomically distant lineages is limited. As for novel species, the rank-flexible VW strategy is more efficient than Kraken, with a greater ability to assign reads to the appropriate genus (or above) where Kraken essentially classify reads at the species level, that is, as a sibling species.</p>
</sec>
<sec>
<title>4.5 Evaluation on a real dataset</title>
<p>We finally compare the predictions obtained by VW and Kraken on the
<italic>real</italic>
, HMP Microbial Mock Community dataset, using the models learned on the
<italic>large</italic>
reference database. As we have no certainty about the correct classification of each individual read, we compare the aggregated predictions over all reads.</p>
<p>
<xref ref-type="fig" rid="btv683-F6">Figure 6</xref>
shows the abundance profiles obtained by the two approaches, restricted to taxa accounting for at least 0.2% of the sample. We note that these abundance profiles are highly comparable. In particular, we note that both approaches reject approximately the same amount of reads, while VW classifies slightly more reads at the genus-level. Eighteen out of the 21 spiked strains are retrieved by both methods (
<italic>Streptococcus agalactiae</italic>
and
<italic>Lactobacillus gasseri</italic>
are not shown in
<xref ref-type="fig" rid="btv683-F6">Figure 6</xref>
because their abundance is below 0.2%) and the three remaining ones (
<italic>Escherichia coli</italic>
,
<italic>Rhodobacter sphaeroides</italic>
and
<italic>Actinomyces odontolyticus</italic>
) are not, because they are not represented in the reference database. Interestingly, both methods classify reads coming from the ‘novel’
<italic>E. coli</italic>
strain into
<italic>Shigella</italic>
species because of their close relatedness (
<xref rid="btv683-B16" ref-type="bibr">Lukjancenko
<italic>et</italic>
<italic>al.</italic>
, 2010</xref>
), with the advantage for VW to take into account this uncertainty by classifying more reads at the
<italic>Shigella</italic>
genus level.
<fig id="btv683-F6" orientation="portrait" position="float">
<label>Fig. 6.</label>
<caption>
<p>
<bold>Abundance profiles obtained with VW (
<bold>A</bold>
) and Kraken (
<bold>B</bold>
) on the HMP spiked dataset</bold>
. Abundance profiles are presented at the genus level, with colors corresponding to the different predicted species, restricted to species accounting for at least 0.2%) of the samples. Light-gray fractions of the bars correspond to genus-level predictions</p>
</caption>
<graphic xlink:href="btv683f6p"></graphic>
</fig>
</p>
</sec>
<sec>
<title>4.6 Classification speed</title>
<p>Last but not least, we now turn to the comparison of the comparative and compositional approaches in terms of prediction time. This aspect is indeed of critical importance for the analysis of the large volumes of sequence data provided by next-generation sequencing technologies and constitutes the main motivation of resorting to
<italic>k</italic>
-mer-based approaches. To perform this evaluation, we measure the time taken by BWA-MEM and the
<italic>k</italic>
-mer-based approaches to process the four test datasets involved in the previous experiments (one fragment dataset, three reads datasets), with the models learned from the
<italic>medium</italic>
and
<italic>large</italic>
databases, to investigate the impact of the number of species involved in the reference database. We do not make a distinction between the VW and NB compositional approaches: both involve computing a score for each candidate species, defined as a dot product between the
<italic>k</italic>
-mer profile of the sequence to classify and a vector of weights obtained by training the model. To compute this dot-product efficiently, we implemented a procedure described in
<xref rid="btv683-B30" ref-type="bibr">Sonnenburg
<italic>et</italic>
<italic>al.</italic>
(2006)</xref>
. With this procedure, each A, T, G, C nucleotide is encoded by two bits, which allows to directly convert a
<italic>k</italic>
-mer as in integer between 0 and
<inline-formula>
<mml:math id="MM39">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
. Provided that the weight vector is loaded into memory, the score can be computed ‘on the fly’ while evaluating the
<italic>k</italic>
-mer profile of the sequence to be classified, by adding the contribution of the current
<italic>k</italic>
-mer to the score. The drawback of this procedure lies in the fact that the vectors of weights defining the classification models need to be loaded into memory, which can be cumbersome in a multiclass setting. For 193 and 774 species and
<italic>k</italic>
-mers of size 12, this amounted to 12 and 48 gigabytes, respectively.</p>
<p>Computation times are measured on a single CPU (Intel XEON-2.8 Ghz) equipped with 250 GB of memory and are detailed in
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/lookup/suppl/doi:10.1093/bioinformatics/btv683/-/DC1">Supplementary Table S4</ext-link>
. The time needed to classify each read or fragment dataset by VW shows little variation, for a given reference database. Its median value obtained across test datasets reaches 4.4 and 8.8 min using the
<italic>medium</italic>
and
<italic>large</italic>
reference databases, respectively, hence about a 2-fold difference. The time taken by Kraken is smaller and more stable across datasets: 2.9 and 3.3 min using the
<italic>medium</italic>
and
<italic>large</italic>
reference databases, respectively, hence about 1.5 and 2.7 times faster than VW. On the
<italic>large</italic>
database, this therefore amounts to classifying around
<inline-formula>
<mml:math id="MM40">
<mml:mrow>
<mml:mn>4</mml:mn>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
and
<inline-formula>
<mml:math id="MM41">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
 200 bp reads per minute with VW and Kraken, respectively. BWA-MEM shows a different behavior. We indeed observe that the time is longer with reads than fragments, while the size of the reference database has a lesser impact. Compositional approaches systematically offer shorter prediction times than the alignment-based approach, with an improvement of 2.4–12.4 for VW and 7–17.3 times with Kraken, depending on the configuration.</p>
<p>Finally,
<xref ref-type="fig" rid="btv683-F7">Figure 7</xref>
further investigates the relationship between the number of species involved in the reference database and VW prediction time. For this purpose, we generate random models involving 1 to 1000 classes and measure the time needed to process each test dataset. We note that for problems involving fewer than 100 candidate species, VW can achieve faster prediction times than Kraken. This situation may in particular arise in diagnostic applications involving specific types of specimens or, at the extreme, regarding the issue of filtering reads coming from the human host in microbiome samples. In this latter case, a binary VW model trained to discriminate bacterial from human reads takes around 20 s to process these test datasets, while the dedicated Kraken model takes around 3 min.
<fig id="btv683-F7" orientation="portrait" position="float">
<label>Fig. 7.</label>
<caption>
<p>
<bold>Impact of the number of classes on VW prediction time.</bold>
Time needed to process each test dataset by VW as the number of classes increases from 1 to 1000. The dashed horizontal line represents the median time measured by Kraken on the
<italic>large</italic>
database</p>
</caption>
<graphic xlink:href="btv683f7p"></graphic>
</fig>
</p>
</sec>
</sec>
<sec>
<title>5 Discussion</title>
<p>In this work, we investigate the potential of modern, large-scale machine learning approaches for taxonomic binning of metagenomics data and propose a new rank-flexible classification strategy. We extensively evaluate its performance when the scale of the problem increases regarding (i) the length of the
<italic>k</italic>
-mers considered to represent a sequence, (ii) the number of fragments used to learn the model and (iii) the number of candidate species involved in the reference database. We also investigate in details its robustness to sequencing errors using simulated reads. We consider three baselines for this evaluation: a comparative approach based on the BWA-MEM sequence aligner and two compositional approaches based on the generative NB classifier and based on Kraken. We demonstrate in particular that increasing the number of fragments used to train the model has a significant impact on the accuracy of the model and allows to estimate models based on longer
<italic>k</italic>
-mers. While this could be expected and was already highlighted by previous studies, the resulting configurations are out of reach of standard SVM implementations. We also show that discriminatively trained compositional models usually offer significantly higher performances than generative NB classifiers. The resulting models are competitive with well-established alignment tools and with Kraken for problems involving a small to moderate number of candidate species and for realistic amounts of sequencing errors. Our results suggest, however, that machine learning-based compositional approaches, both discriminative and generative, are still limited in their ability to deal with problems involving more than a few hundreds species. In this case, indeed, compositional approaches exhibit lower performance than alignment-based approaches and are much more negatively impacted by sequencing errors. When confronted with species absent from the training set, we show that our model is more accurate than Kraken, which has a larger level of rejection due to its use of longer
<italic>k</italic>
-mers, and affects more reads too specifically to species of the reference dataset. Finally, we confirm that compositional approaches achieve faster prediction times. This is indeed systematically the case in the various configurations listed above, with predictions obtained 2–17 times faster by compositional approaches, and, interestingly, depends on the number of candidate species. We note in particular, that for problems involving fewer than 100 candidate species, which may arise in diagnostic applications involving specific types of specimens, VW can achieve faster prediction times than Kraken. At the extreme, for the binary problem aiming to separate bacterial from human reads, which is commonly used while analyzing a microbiome sample, VW can offer a 9-fold increase in terms of prediction time with respect to Kraken. We emphasize, however, that fast predictions can only be obtained provided that the classification models are loaded in memory, hence for a memory footprint that scales linearly with the number of candidate species and exponentially with the size of the
<italic>k</italic>
-mers, which can become important for large reference databases and long
<italic>k</italic>
-mers.</p>
<p>At least three simple extensions could be envisioned to make compositional approaches more competitive in accuracy with the alignment-based approach, faster and to limit their memory footprint. First, the robustness to sequencing errors may be improved by learning models from simulated reads instead of fragments. This could indeed allow to tune the model to the sequencing technology producing the reads to be analyzed, provided its error model is properly known and characterized. Second, introducing a sparsity-inducing penalty while learning the model would have the effect of reducing the number of features entering the model, hence to reduce the memory footprint required to load the model into memory. Finally, alternative strategies, known as error correcting tournaments (
<xref rid="btv683-B4" ref-type="bibr">Beygelzimer
<italic>et</italic>
<italic>al.</italic>
, 2009</xref>
), could be straightforwardly considered to reduce the number of models to learn, hence to store into memory during prediction, to address a multiclass problem. Our results indeed suggest that addressing these issues is critical to build state-of-the-art compositional classifiers to analyze metagenomics samples that may involve a broad spectrum of species.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material id="PMC_1" content-type="local-data">
<caption>
<title>Supplementary Data</title>
</caption>
<media mimetype="text" mime-subtype="html" xlink:href="supp_32_7_1023__index.html"></media>
<media xlink:role="associated-file" mimetype="application" mime-subtype="pdf" xlink:href="supp_btv683_supplementary.pdf"></media>
</supplementary-material>
</sec>
</body>
<back>
<ack>
<title>Funding</title>
<p>This work was supported by the
<funding-source>European Research Council</funding-source>
(
<award-id>SMAC-ERC-280032</award-id>
to J-P.V.) and the
<funding-source>French National Research Agency</funding-source>
(
<award-id>ANR-11-BINF-0001</award-id>
to J-P.V.).</p>
<p>
<italic>Conflict of Interest</italic>
: none declared.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="btv683-B1">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Agarwal</surname>
<given-names>A.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2014</year>
)
<article-title>A reliable effective terascale linear learning system</article-title>
.
<source>J. Mach. Learn. Res.</source>
,
<volume>15</volume>
,
<fpage>1111</fpage>
<lpage>1133</lpage>
.</mixed-citation>
</ref>
<ref id="btv683-B2">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Angly</surname>
<given-names>F.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2012</year>
)
<article-title>Grinder: a versatile amplicon and shotgun sequence simulator</article-title>
.
<source>Nucleic Acids Res.</source>
,
<volume>40</volume>
,
<fpage>94</fpage>
<lpage>94</lpage>
.</mixed-citation>
</ref>
<ref id="btv683-B3">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Balzer</surname>
<given-names>S.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2010</year>
)
<article-title>Characteristics of 454 pyrosequencing data enabling realistic simulation with flowsim</article-title>
.
<source>Bioinformatics</source>
,
<volume>26</volume>
,
<fpage>420</fpage>
<lpage>425</lpage>
.</mixed-citation>
</ref>
<ref id="btv683-B4">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Beygelzimer</surname>
<given-names>A.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2009</year>
)
<article-title>Error-correcting tournaments</article-title>
.
<source>Algorithmic Learn. Theory</source>
,
<volume>5809</volume>
,
<fpage>247</fpage>
<lpage>262</lpage>
.</mixed-citation>
</ref>
<ref id="btv683-B5">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bottou</surname>
<given-names>L.</given-names>
</name>
</person-group>
(
<year>1998</year>
)
<article-title>Online learning and stochastic approximations</article-title>
.
<source>Online Learn. Neural Netw.</source>
,
<volume>17</volume>
,
<fpage>9</fpage>
<lpage>42</lpage>
.</mixed-citation>
</ref>
<ref id="btv683-B6">
<mixed-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Bottou</surname>
<given-names>L.</given-names>
</name>
</person-group>
(
<year>2010</year>
)
<article-title>Large-scale machine learning with stochastic gradient descent</article-title>
.
<comment>In: Lechevallier,Y. and Saporta,G. (eds),
<italic>Proceedings of COMPSTAT’2010</italic>
, Physica-Verlag, pp. 177–186</comment>
.</mixed-citation>
</ref>
<ref id="btv683-B7">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gammerman</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Vovk</surname>
<given-names>V.</given-names>
</name>
</person-group>
(
<year>2007</year>
)
<article-title>Eedging predictions in machine learning</article-title>
.
<source>Comput. J.</source>
,
<volume>50</volume>
,
<fpage>151</fpage>
<lpage>163</lpage>
.</mixed-citation>
</ref>
<ref id="btv683-B8">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hugenholtz</surname>
<given-names>P.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2002</year>
)
<article-title>Exploring prokaryotic diversity in the genomic era</article-title>
.
<source>Genome Biol.</source>
,
<volume>3</volume>
,
<fpage>1</fpage>
<lpage>3</lpage>
.</mixed-citation>
</ref>
<ref id="btv683-B9">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huson</surname>
<given-names>D.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2007</year>
)
<article-title>MEGAN analysis of metagenomic data</article-title>
.
<source>Genome Res.</source>
,
<volume>17</volume>
,
<fpage>377</fpage>
<lpage>386</lpage>
.
<pub-id pub-id-type="pmid">17255551</pub-id>
</mixed-citation>
</ref>
<ref id="btv683-B10">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Korbel</surname>
<given-names>J.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2009</year>
)
<article-title>PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data</article-title>
.
<source>Genome Biol.</source>
,
<volume>10</volume>
,
<fpage>23</fpage>
<lpage>23</lpage>
.</mixed-citation>
</ref>
<ref id="btv683-B11">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Koslicki</surname>
<given-names>D.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2014</year>
)
<article-title>WGSQuikr: fast whole-genome shotgun metagenomic classification</article-title>
.
<source>PLoS One</source>
,
<volume>9</volume>
,
<fpage>e91784</fpage>
.
<pub-id pub-id-type="pmid">24626336</pub-id>
</mixed-citation>
</ref>
<ref id="btv683-B12">
<mixed-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Langford</surname>
<given-names>J.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2007</year>
)
<article-title>Vowpal Wabbit open source project</article-title>
.
<source>Technical report</source>
.
<publisher-name>Yahoo</publisher-name>
.</mixed-citation>
</ref>
<ref id="btv683-B13">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
</person-group>
(
<year>2013</year>
)
<article-title>Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM</article-title>
.
<source>arXiv preprint arXiv:1303.3997</source>
.</mixed-citation>
</ref>
<ref id="btv683-B14">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Durbin</surname>
<given-names>R.</given-names>
</name>
</person-group>
(
<year>2009</year>
)
<article-title>Fast and accurate short read alignment with Burrows-Wheeler transform</article-title>
.
<source>Bioinformatics</source>
,
<volume>25</volume>
,
<fpage>1754</fpage>
<lpage>1760</lpage>
.
<pub-id pub-id-type="pmid">19451168</pub-id>
</mixed-citation>
</ref>
<ref id="btv683-B15">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lindner</surname>
<given-names>M.S.</given-names>
</name>
<name>
<surname>Renard</surname>
<given-names>B.Y.</given-names>
</name>
</person-group>
(
<year>2012</year>
)
<article-title>Metagenomic abundance estimation and diagnostic testing on species level</article-title>
.
<source>Nucleic Acids Res.</source>
,
<volume>41</volume>
,
<fpage>e10</fpage>
.
<pub-id pub-id-type="pmid">22941661</pub-id>
</mixed-citation>
</ref>
<ref id="btv683-B16">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lukjancenko</surname>
<given-names>O.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2010</year>
)
<article-title>Comparison of 61 sequenced
<italic>Escherichia coli</italic>
genomes</article-title>
.
<source>Microb. Ecol.</source>
,
<volume>60</volume>
,
<fpage>708</fpage>
<lpage>720</lpage>
.
<pub-id pub-id-type="pmid">20623278</pub-id>
</mixed-citation>
</ref>
<ref id="btv683-B17">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mande</surname>
<given-names>S.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2012</year>
)
<article-title>Classification of metagenomic sequences: methods and challenges</article-title>
.
<source>Brief Bioinform.</source>
,
<volume>13</volume>
,
<fpage>669</fpage>
<lpage>681</lpage>
.
<pub-id pub-id-type="pmid">22962338</pub-id>
</mixed-citation>
</ref>
<ref id="btv683-B19">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Martin</surname>
<given-names>J.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2012</year>
)
<article-title>Optimizing read mapping to reference genomes to determine composition and species prevalence in microbial communities</article-title>
.
<source>PLoS One</source>
,
<volume>7</volume>
,
<fpage>e36427</fpage>
.
<pub-id pub-id-type="pmid">22719831</pub-id>
</mixed-citation>
</ref>
<ref id="btv683-B20">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>McHardy</surname>
<given-names>A.C.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2007</year>
)
<article-title>Accurate phylogenetic classification of variable-length DNA fragments</article-title>
.
<source>Nat. Methods</source>
,
<volume>4</volume>
,
<fpage>63</fpage>
<lpage>72</lpage>
.
<pub-id pub-id-type="pmid">17179938</pub-id>
</mixed-citation>
</ref>
<ref id="btv683-B21">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Miller</surname>
<given-names>R.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2013</year>
)
<article-title>Metagenomics for pathogen detection in public health</article-title>
.
<source>Genome Med.</source>
,
<volume>5</volume>
,
<fpage>81</fpage>
<lpage>95</lpage>
.
<pub-id pub-id-type="pmid">24050114</pub-id>
</mixed-citation>
</ref>
<ref id="btv683-B22">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Parks</surname>
<given-names>D.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2011</year>
)
<article-title>Classifying short genomic fragments from novel lineages using composition and homology</article-title>
.
<source>BMC Bioinformatics</source>
,
<volume>12</volume>
,
<fpage>328</fpage>
<lpage>344</lpage>
.
<pub-id pub-id-type="pmid">21827705</pub-id>
</mixed-citation>
</ref>
<ref id="btv683-B23">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Patil</surname>
<given-names>K.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2012</year>
)
<article-title>The PhyloPythiaS web server for taxonomic assignment of metagenome sequences</article-title>
.
<source>PLoS One</source>
,
<volume>7</volume>
,
<fpage>e38581</fpage>
.
<pub-id pub-id-type="pmid">22745671</pub-id>
</mixed-citation>
</ref>
<ref id="btv683-B24">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Peterson</surname>
<given-names>J.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2009</year>
)
<article-title>The NIH human microbiome project</article-title>
.
<source>Genome Res.</source>
,
<volume>19</volume>
,
<fpage>2317</fpage>
<lpage>2323</lpage>
.
<pub-id pub-id-type="pmid">19819907</pub-id>
</mixed-citation>
</ref>
<ref id="btv683-B25">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pruitt</surname>
<given-names>K.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2012</year>
)
<article-title>NCBI reference sequences (RefSeq): current status, new features and genome annotation policy</article-title>
.
<source>Nucleic Acids Res.</source>
,
<volume>40</volume>
,
<fpage>130</fpage>
<lpage>135</lpage>
.</mixed-citation>
</ref>
<ref id="btv683-B27">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Riesenfeld</surname>
<given-names>C.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2004</year>
)
<article-title>Metagenomics: genomic analysis of microbial communities</article-title>
.
<source>Annu. Rev. Genet.</source>
,
<volume>38</volume>
,
<fpage>525</fpage>
<lpage>552</lpage>
.
<pub-id pub-id-type="pmid">15568985</pub-id>
</mixed-citation>
</ref>
<ref id="btv683-B28">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rosen</surname>
<given-names>G.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2011</year>
)
<article-title>NBC: the Naive Bayes classification tool webserver for taxonomic classification of metagenomic reads</article-title>
.
<source>Bioinformatics</source>
,
<volume>27</volume>
,
<fpage>127</fpage>
<lpage>129</lpage>
.
<pub-id pub-id-type="pmid">21062764</pub-id>
</mixed-citation>
</ref>
<ref id="btv683-B29">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schmieder</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Edwards</surname>
<given-names>R.</given-names>
</name>
</person-group>
(
<year>2012</year>
)
<article-title>Insights into antibiotic resistance through metagenomic approaches</article-title>
.
<source>Future Microbiol.</source>
,
<volume>7</volume>
,
<fpage>73</fpage>
<lpage>89</lpage>
.
<pub-id pub-id-type="pmid">22191448</pub-id>
</mixed-citation>
</ref>
<ref id="btv683-B30">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sonnenburg</surname>
<given-names>S.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2006</year>
)
<article-title>Large scale learning with string kernels</article-title>
.
<source>J. Mach. Learn. Res.</source>
,
<volume>7</volume>
,
<fpage>1531</fpage>
<lpage>1565</lpage>
.</mixed-citation>
</ref>
<ref id="btv683-B31">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Soon</surname>
<given-names>W.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2013</year>
)
<article-title>High-throughput sequencing for biology and medicine</article-title>
.
<source>Mol. Syst. Biol.</source>
,
<volume>9</volume>
.</mixed-citation>
</ref>
<ref id="btv683-B32">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>Q.</given-names>
</name>
<etal></etal>
</person-group>
(
<year>2007</year>
)
<article-title>Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy</article-title>
.
<source>Appl. Environ. Microbiol.</source>
,
<volume>73</volume>
,
<fpage>5261</fpage>
<lpage>5267</lpage>
.
<pub-id pub-id-type="pmid">17586664</pub-id>
</mixed-citation>
</ref>
<ref id="btv683-B33">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wood</surname>
<given-names>D.E.</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>S.L.</given-names>
</name>
</person-group>
(
<year>2014</year>
)
<article-title>Kraken: ultrafast metagenomic sequence classification using exact alignments</article-title>
.
<source>Genome Biol.</source>
,
<volume>15</volume>
,
<fpage>R46</fpage>
.
<pub-id pub-id-type="pmid">24580807</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000085 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000085 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:4896366
   |texte=   Large-scale machine learning for metagenomics sequence classification
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:26589281" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021