Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000286 ( Pmc/Corpus ); précédent : 0002859; suivant : 0002870 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">MetaCon: unsupervised clustering of metagenomic contigs with probabilistic k-mers statistics and coverage</title>
<author>
<name sortKey="Qian, Jia" sort="Qian, Jia" uniqKey="Qian J" first="Jia" last="Qian">Jia Qian</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Comin, Matteo" sort="Comin, Matteo" uniqKey="Comin M" first="Matteo" last="Comin">Matteo Comin</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">31757198</idno>
<idno type="pmc">6873667</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6873667</idno>
<idno type="RBID">PMC:6873667</idno>
<idno type="doi">10.1186/s12859-019-2904-4</idno>
<date when="2019">2019</date>
<idno type="wicri:Area/Pmc/Corpus">000286</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000286</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">MetaCon: unsupervised clustering of metagenomic contigs with probabilistic k-mers statistics and coverage</title>
<author>
<name sortKey="Qian, Jia" sort="Qian, Jia" uniqKey="Qian J" first="Jia" last="Qian">Jia Qian</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Comin, Matteo" sort="Comin, Matteo" uniqKey="Comin M" first="Matteo" last="Comin">Matteo Comin</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2019">2019</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Motivation</title>
<p>Sequencing technologies allow the sequencing of microbial communities directly from the environment without prior culturing. Because assembly typically produces only genome fragments, also known as contigs, it is crucial to group them into putative species for further taxonomic profiling and down-streaming functional analysis. Taxonomic analysis of microbial communities requires contig clustering, a process referred to as binning, that is still one of the most challenging tasks when analyzing metagenomic data. The major problems are the lack of taxonomically related genomes in existing reference databases, the uneven abundance ratio of species, sequencing errors, and the limitations due to binning contig of different lengths.</p>
</sec>
<sec>
<title>Results</title>
<p>In this context we present MetaCon a novel tool for unsupervised metagenomic contig binning based on probabilistic k-mers statistics and coverage. MetaCon uses a signature based on k-mers statistics that accounts for the different probability of appearance of a k-mer in different species, also contigs of different length are clustered in two separate phases. The effectiveness of MetaCon is demonstrated in both simulated and real datasets in comparison with state-of-art binning approaches such as CONCOCT, MaxBin and MetaBAT.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (10.1186/s12859-019-2904-4) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Staley, Jt" uniqKey="Staley J">JT Staley</name>
</author>
<author>
<name sortKey="Konopka, A" uniqKey="Konopka A">A Konopka</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Handelsman, J" uniqKey="Handelsman J">J Handelsman</name>
</author>
<author>
<name sortKey="Rondon, Mr" uniqKey="Rondon M">MR Rondon</name>
</author>
<author>
<name sortKey="Brady, Sf" uniqKey="Brady S">SF Brady</name>
</author>
<author>
<name sortKey="Clardy, J" uniqKey="Clardy J">J Clardy</name>
</author>
<author>
<name sortKey="Goodman, Rm" uniqKey="Goodman R">RM Goodman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Felczykowska, A" uniqKey="Felczykowska A">A Felczykowska</name>
</author>
<author>
<name sortKey="Bloch, Sk" uniqKey="Bloch S">SK Bloch</name>
</author>
<author>
<name sortKey="Nejman Fale Czyk, B" uniqKey="Nejman Fale Czyk B">B Nejman-Faleńczyk</name>
</author>
<author>
<name sortKey="Bara Ska, S" uniqKey="Bara Ska S">S Barańska</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mande, Ss" uniqKey="Mande S">SS Mande</name>
</author>
<author>
<name sortKey="Mohammed, Mh" uniqKey="Mohammed M">MH Mohammed</name>
</author>
<author>
<name sortKey="Ghosh, Ts" uniqKey="Ghosh T">TS Ghosh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Alneberg, J" uniqKey="Alneberg J">J Alneberg</name>
</author>
<author>
<name sortKey="Brynjar Smari, B" uniqKey="Brynjar Smari B">B Brynjar Smári</name>
</author>
<author>
<name sortKey="Ino, Db" uniqKey="Ino D">DB Ino</name>
</author>
<author>
<name sortKey="Melanie, S" uniqKey="Melanie S">S Melanie</name>
</author>
<author>
<name sortKey="Joshua, Q" uniqKey="Joshua Q">Q Joshua</name>
</author>
<author>
<name sortKey="Umer Z, I" uniqKey="Umer Z I">I Umer Z</name>
</author>
<author>
<name sortKey="Leo, L" uniqKey="Leo L">L Leo</name>
</author>
<author>
<name sortKey="Nicholas J, L" uniqKey="Nicholas J L">L Nicholas J</name>
</author>
<author>
<name sortKey="Anders F, A" uniqKey="Anders F A">A Anders F</name>
</author>
<author>
<name sortKey="Christopher, Q" uniqKey="Christopher Q">Q Christopher</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bowers, Rm" uniqKey="Bowers R">RM Bowers</name>
</author>
<author>
<name sortKey="Clum, A" uniqKey="Clum A">A Clum</name>
</author>
<author>
<name sortKey="Tice, H" uniqKey="Tice H">H Tice</name>
</author>
<author>
<name sortKey="Lim, J" uniqKey="Lim J">J Lim</name>
</author>
<author>
<name sortKey="Singh, K" uniqKey="Singh K">K Singh</name>
</author>
<author>
<name sortKey="Ciobanu, D" uniqKey="Ciobanu D">D Ciobanu</name>
</author>
<author>
<name sortKey="Ngan, Cy" uniqKey="Ngan C">CY Ngan</name>
</author>
<author>
<name sortKey="Cheng, J F" uniqKey="Cheng J">J-F Cheng</name>
</author>
<author>
<name sortKey="Tringe, Sg" uniqKey="Tringe S">SG Tringe</name>
</author>
<author>
<name sortKey="Woyke, T" uniqKey="Woyke T">T Woyke</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sczyrba, A" uniqKey="Sczyrba A">A Sczyrba</name>
</author>
<author>
<name sortKey="Hofmann, P" uniqKey="Hofmann P">P Hofmann</name>
</author>
<author>
<name sortKey="Mchardy, Ac" uniqKey="Mchardy A">AC McHardy</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huson, D H" uniqKey="Huson D">D. H. Huson</name>
</author>
<author>
<name sortKey="Auch, A F" uniqKey="Auch A">A. F. Auch</name>
</author>
<author>
<name sortKey="Qi, J" uniqKey="Qi J">J. Qi</name>
</author>
<author>
<name sortKey="Schuster, S C" uniqKey="Schuster S">S. C. Schuster</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wood, Derrick E" uniqKey="Wood D">Derrick E Wood</name>
</author>
<author>
<name sortKey="Salzberg, Steven L" uniqKey="Salzberg S">Steven L Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ounit, R" uniqKey="Ounit R">R Ounit</name>
</author>
<author>
<name sortKey="Wanamaker, S" uniqKey="Wanamaker S">S Wanamaker</name>
</author>
<author>
<name sortKey="Close, Tj" uniqKey="Close T">TJ Close</name>
</author>
<author>
<name sortKey="Lonardi, S" uniqKey="Lonardi S">S Lonardi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Qian, J" uniqKey="Qian J">J Qian</name>
</author>
<author>
<name sortKey="Marchiori, D" uniqKey="Marchiori D">D Marchiori</name>
</author>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Segata, Nicola" uniqKey="Segata N">Nicola Segata</name>
</author>
<author>
<name sortKey="Waldron, Levi" uniqKey="Waldron L">Levi Waldron</name>
</author>
<author>
<name sortKey="Ballarini, Annalisa" uniqKey="Ballarini A">Annalisa Ballarini</name>
</author>
<author>
<name sortKey="Narasimhan, Vagheesh" uniqKey="Narasimhan V">Vagheesh Narasimhan</name>
</author>
<author>
<name sortKey="Jousson, Olivier" uniqKey="Jousson O">Olivier Jousson</name>
</author>
<author>
<name sortKey="Huttenhower, Curtis" uniqKey="Huttenhower C">Curtis Huttenhower</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Eisen, Ja" uniqKey="Eisen J">JA Eisen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lindgreen, S" uniqKey="Lindgreen S">S Lindgreen</name>
</author>
<author>
<name sortKey="Adair, Kl" uniqKey="Adair K">KL Adair</name>
</author>
<author>
<name sortKey="Gardner, P" uniqKey="Gardner P">P Gardner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Girotto, S" uniqKey="Girotto S">S Girotto</name>
</author>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Pizzi, C" uniqKey="Pizzi C">C Pizzi</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Leung, Hcm" uniqKey="Leung H">HCM Leung</name>
</author>
<author>
<name sortKey="Yiu, Sm" uniqKey="Yiu S">SM Yiu</name>
</author>
<author>
<name sortKey="Yang, B" uniqKey="Yang B">B Yang</name>
</author>
<author>
<name sortKey="Peng, Y" uniqKey="Peng Y">Y Peng</name>
</author>
<author>
<name sortKey="Wang, Y" uniqKey="Wang Y">Y Wang</name>
</author>
<author>
<name sortKey="Liu, Z" uniqKey="Liu Z">Z Liu</name>
</author>
<author>
<name sortKey="Chen, J" uniqKey="Chen J">J Chen</name>
</author>
<author>
<name sortKey="Qin, J" uniqKey="Qin J">J Qin</name>
</author>
<author>
<name sortKey="Li, R" uniqKey="Li R">R Li</name>
</author>
<author>
<name sortKey="Chin, Fyl" uniqKey="Chin F">FYL Chin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wu, Y W" uniqKey="Wu Y">Y-W Wu</name>
</author>
<author>
<name sortKey="Simmons, Ba" uniqKey="Simmons B">BA Simmons</name>
</author>
<author>
<name sortKey="Singer, Sw" uniqKey="Singer S">SW Singer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Imelfort, Michael" uniqKey="Imelfort M">Michael Imelfort</name>
</author>
<author>
<name sortKey="Parks, Donovan" uniqKey="Parks D">Donovan Parks</name>
</author>
<author>
<name sortKey="Woodcroft, Ben J" uniqKey="Woodcroft B">Ben J. Woodcroft</name>
</author>
<author>
<name sortKey="Dennis, Paul" uniqKey="Dennis P">Paul Dennis</name>
</author>
<author>
<name sortKey="Hugenholtz, Philip" uniqKey="Hugenholtz P">Philip Hugenholtz</name>
</author>
<author>
<name sortKey="Tyson, Gene W" uniqKey="Tyson G">Gene W. Tyson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kang, Dd" uniqKey="Kang D">DD Kang</name>
</author>
<author>
<name sortKey="Froula, J" uniqKey="Froula J">J Froula</name>
</author>
<author>
<name sortKey="Egan, R" uniqKey="Egan R">R Egan</name>
</author>
<author>
<name sortKey="Wang, Z" uniqKey="Wang Z">Z Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kantorovitz, Miriam R" uniqKey="Kantorovitz M">Miriam R. Kantorovitz</name>
</author>
<author>
<name sortKey="Robinson, Gene E" uniqKey="Robinson G">Gene E. Robinson</name>
</author>
<author>
<name sortKey="Sinha, Saurabh" uniqKey="Sinha S">Saurabh Sinha</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sims, Gregory E" uniqKey="Sims G">Gregory E. Sims</name>
</author>
<author>
<name sortKey="Jun, Se Ran" uniqKey="Jun S">Se-Ran Jun</name>
</author>
<author>
<name sortKey="Wu, Guohong A" uniqKey="Wu G">Guohong A. Wu</name>
</author>
<author>
<name sortKey="Kim, Sung Hou" uniqKey="Kim S">Sung-Hou Kim</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Antonello, M" uniqKey="Antonello M">M Antonello</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Verzotto, D" uniqKey="Verzotto D">D Verzotto</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Leoni, A" uniqKey="Leoni A">A Leoni</name>
</author>
<author>
<name sortKey="Schimd, M" uniqKey="Schimd M">M Schimd</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Antonello, M" uniqKey="Antonello M">M Antonello</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lippert, Ra" uniqKey="Lippert R">RA Lippert</name>
</author>
<author>
<name sortKey="Huang, H" uniqKey="Huang H">H Huang</name>
</author>
<author>
<name sortKey="Waterman, Ms" uniqKey="Waterman M">MS Waterman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="G, R" uniqKey="G R">R G</name>
</author>
<author>
<name sortKey="D, C" uniqKey="D C">C D</name>
</author>
<author>
<name sortKey="F, S" uniqKey="F S">S F</name>
</author>
<author>
<name sortKey="Ms, W" uniqKey="Ms W">W MS</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Regnier, M" uniqKey="Regnier M">M Régnier</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Song, K" uniqKey="Song K">K Song</name>
</author>
<author>
<name sortKey="Ren, J" uniqKey="Ren J">J Ren</name>
</author>
<author>
<name sortKey="Reinert, G" uniqKey="Reinert G">G Reinert</name>
</author>
<author>
<name sortKey="Deng, M" uniqKey="Deng M">M Deng</name>
</author>
<author>
<name sortKey="Waterman, Ms" uniqKey="Waterman M">MS Waterman</name>
</author>
<author>
<name sortKey="Sun, F" uniqKey="Sun F">F Sun</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kaufmann, L" uniqKey="Kaufmann L">L Kaufmann</name>
</author>
<author>
<name sortKey="Rousseeuw, P" uniqKey="Rousseeuw P">P Rousseeuw</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Consortium, Hmp" uniqKey="Consortium H">HMP Consortium</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Boisvert, S" uniqKey="Boisvert S">S Boisvert</name>
</author>
<author>
<name sortKey="Raymond, F" uniqKey="Raymond F">F Raymond</name>
</author>
<author>
<name sortKey="Godzaridis, E" uniqKey="Godzaridis E">É Godzaridis</name>
</author>
<author>
<name sortKey="Laviolette, F" uniqKey="Laviolette F">F Laviolette</name>
</author>
<author>
<name sortKey="Corbeil, J" uniqKey="Corbeil J">J Corbeil</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sharon, I" uniqKey="Sharon I">I Sharon</name>
</author>
<author>
<name sortKey="Morowitz, Mj" uniqKey="Morowitz M">MJ Morowitz</name>
</author>
<author>
<name sortKey="Thomas, Bc" uniqKey="Thomas B">BC Thomas</name>
</author>
<author>
<name sortKey="Costello, Ek" uniqKey="Costello E">EK Costello</name>
</author>
<author>
<name sortKey="Relman, Da" uniqKey="Relman D">DA Relman</name>
</author>
<author>
<name sortKey="Banfield, Jf" uniqKey="Banfield J">JF Banfield</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vinh, Lv" uniqKey="Vinh L">LV Vinh</name>
</author>
<author>
<name sortKey="Lang, Tv" uniqKey="Lang T">TV Lang</name>
</author>
<author>
<name sortKey="Binh, Lt" uniqKey="Binh L">LT Binh</name>
</author>
<author>
<name sortKey="Hoai, Tv" uniqKey="Hoai T">TV Hoai</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">31757198</article-id>
<article-id pub-id-type="pmc">6873667</article-id>
<article-id pub-id-type="publisher-id">2904</article-id>
<article-id pub-id-type="doi">10.1186/s12859-019-2904-4</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>MetaCon: unsupervised clustering of metagenomic contigs with probabilistic k-mers statistics and coverage</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Qian</surname>
<given-names>Jia</given-names>
</name>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Comin</surname>
<given-names>Matteo</given-names>
</name>
<address>
<email>comin@dei.unipd.it</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1757 3470</institution-id>
<institution-id institution-id-type="GRID">grid.5608.b</institution-id>
<institution>Department of Information Engineering, University of Padova,</institution>
</institution-wrap>
Via Giovanni Gradenigo 6, Padova, Italy</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>22</day>
<month>11</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>22</day>
<month>11</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="collection">
<year>2019</year>
</pub-date>
<volume>20</volume>
<issue>Suppl 9</issue>
<issue-sponsor>Publication of this supplement has not been supported by sponsorship. Information about the source of funding for publication charges can be found in the individual articles. The articles have undergone the journal's standard peer review process for supplements. The Supplement Editors declare that they have no competing interests.</issue-sponsor>
<elocation-id>367</elocation-id>
<history>
<date date-type="received">
<day>30</day>
<month>4</month>
<year>2019</year>
</date>
<date date-type="accepted">
<day>15</day>
<month>5</month>
<year>2019</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s) 2019</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Motivation</title>
<p>Sequencing technologies allow the sequencing of microbial communities directly from the environment without prior culturing. Because assembly typically produces only genome fragments, also known as contigs, it is crucial to group them into putative species for further taxonomic profiling and down-streaming functional analysis. Taxonomic analysis of microbial communities requires contig clustering, a process referred to as binning, that is still one of the most challenging tasks when analyzing metagenomic data. The major problems are the lack of taxonomically related genomes in existing reference databases, the uneven abundance ratio of species, sequencing errors, and the limitations due to binning contig of different lengths.</p>
</sec>
<sec>
<title>Results</title>
<p>In this context we present MetaCon a novel tool for unsupervised metagenomic contig binning based on probabilistic k-mers statistics and coverage. MetaCon uses a signature based on k-mers statistics that accounts for the different probability of appearance of a k-mer in different species, also contigs of different length are clustered in two separate phases. The effectiveness of MetaCon is demonstrated in both simulated and real datasets in comparison with state-of-art binning approaches such as CONCOCT, MaxBin and MetaBAT.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (10.1186/s12859-019-2904-4) contains supplementary material, which is available to authorized users.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Metagenomics</kwd>
<kwd>Unsupervised clustering</kwd>
<kwd>K-mers statistics</kwd>
</kwd-group>
<conference xlink:href="http://bioinformatics.it/">
<conf-name>Annual Meeting of the Bioinformatics Italian Society (BITS 2018)</conf-name>
<conf-acronym>BITS 2018</conf-acronym>
<conf-loc>Turin, Italy</conf-loc>
<conf-date>27 - 29 June 2018</conf-date>
</conference>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2019</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1" sec-type="introduction">
<title>Introduction</title>
<p>Studies in microbial ecology commonly experience a bottleneck effect due to difficulties in microbial isolation and cultivation [
<xref ref-type="bibr" rid="CR1">1</xref>
]. Due to the difficulty in culturing most organisms in a laboratory, alternative methods to analyze microbial diversity are commonly used to study community structure and functionality.</p>
<p>One such method is the sequencing of the collective genomes (metagenomics) of all microorganisms in an environment [
<xref ref-type="bibr" rid="CR2">2</xref>
]. Metagenomics is a study of the heterogeneous microbes samples (e.g. soil, water, human microbiome) directly extracted from the natural environment with the primary goal of determining the taxonomical identity of the microorganisms residing in the samples. It is an evolutionary revise, shifting focuses from the individual microbe study to a complex microbial community. As already mentioned in [
<xref ref-type="bibr" rid="CR3">3</xref>
,
<xref ref-type="bibr" rid="CR4">4</xref>
], the classical genomic-based approaches require the prior clone and culturing for the further investigation. However, not all bacteria can be cultured. The advent of metagenomics succeeded to bypass this difficulty.</p>
<p>To further investigate the taxonomic structure of microbial samples, assembled sequence fragments, also known as contigs, need be grouped into bin that ultimately represent genomes. Contig binning serves as the key step toward taxonomic profiling and downstream functional analysis. Therefore, accurate binning of the contigs is an essential problem in metagenomic studies.</p>
<p>Grouping contigs into bins of putative species is one of the hurdles faced when analyzing metagenomic data. Typically, one of a few issues are encountered including: struggling to differentiate related microorganisms, repetitive sequence regions within or across genomes, sequencing errors, and strain-level variation within the same species, decreasing accuracy for contigs below a size threshold, or excluding low coverage and low abundance organisms [
<xref ref-type="bibr" rid="CR5">5</xref>
,
<xref ref-type="bibr" rid="CR6">6</xref>
].</p>
<p>Despite extensive studies, accurate binning of contigs remains challenging [
<xref ref-type="bibr" rid="CR7">7</xref>
]. One category is reference-based (supervised), that is, reference databases are needed for the assignment from contigs or reads to meaningful taxons. The classification is either based on homology, or genomic signatures such as oligonucleotide composition patterns and taxonomic clades. Among the most important methods we can recall: Megan [
<xref ref-type="bibr" rid="CR8">8</xref>
], Kraken [
<xref ref-type="bibr" rid="CR9">9</xref>
], Clark [
<xref ref-type="bibr" rid="CR10">10</xref>
], SKraken [
<xref ref-type="bibr" rid="CR11">11</xref>
], and MetaPhlan [
<xref ref-type="bibr" rid="CR12">12</xref>
].</p>
<p>Reference-based methods require to index a database of target genomes, e.g. the NCBI/RefSeq databases of bacterial genomes, that is used to classify. These methods are usually very demanding, requiring computing capabilities with large amounts of RAM and disk space. Yet, query sequences originating from the genomes of most microbes in an environmental sample lack taxonomically related sequences in existing reference databases. Most bacteria found in environmental samples are unknown and cannot be cultured and separated in the laboratory [
<xref ref-type="bibr" rid="CR13">13</xref>
]. For these reasons, when using reference-based methods the number of unassigned contigs can be very high [
<xref ref-type="bibr" rid="CR14">14</xref>
,
<xref ref-type="bibr" rid="CR15">15</xref>
].</p>
<p>The other category of methods is reference-free (unsupervised), where studies extract features from contigs to infer bins based on sequence composition [
<xref ref-type="bibr" rid="CR16">16</xref>
<xref ref-type="bibr" rid="CR18">18</xref>
], abundance [
<xref ref-type="bibr" rid="CR19">19</xref>
], or hybrids of both sequence composition and abundance [
<xref ref-type="bibr" rid="CR5">5</xref>
,
<xref ref-type="bibr" rid="CR20">20</xref>
<xref ref-type="bibr" rid="CR22">22</xref>
]. Therefore, these approaches can be applied to bin contigs from incomplete or uncultivated genomes. Some hybrid binning tools, such as CONCOCT [
<xref ref-type="bibr" rid="CR5">5</xref>
], MaxBin2.0 [
<xref ref-type="bibr" rid="CR20">20</xref>
] and GroopM [
<xref ref-type="bibr" rid="CR21">21</xref>
], are designed to bin contigs based on multiple related metagenomic samples. Among these methods, GroopM [
<xref ref-type="bibr" rid="CR21">21</xref>
] is advantageous in its visualized and interactive pipeline. On one hand, it is flexible, allowing users to merge and split bins, on the other hand, in the absence of expert intervention, the automatic binning results of GroopM is not as satisfactory as CONCOCT [
<xref ref-type="bibr" rid="CR5">5</xref>
]. CONCOCT [
<xref ref-type="bibr" rid="CR5">5</xref>
] makes use of the Gaussian mixture model (GMM) to cluster contigs into bins. MetaBAT [
<xref ref-type="bibr" rid="CR22">22</xref>
] calculates integrated distance for pairwise contigs and then clusters contigs iteratively by modified K-medoids algorithm. MaxBin [
<xref ref-type="bibr" rid="CR20">20</xref>
] compares the distributions of distances between and within the same genomes.</p>
<p>The composition of DNA, in terms of its constituent
<italic>k</italic>
-mers, is known to be a feature of the genome. All the above studies are based on the assumption that the
<italic>k</italic>
-mer frequency distributions of long fragments or whole genome sequences are unique to each genome. More precisely, contig binning using k-mers composition is based on the observation that relative sequence compositions are similar across different regions of the same genome, but differ between distinct genomes.</p>
<p>In general, current tools, use the simple
<italic>k</italic>
-mers counts as signature for a genome, and these counts are normalized, for ease of comparison, in a global fashion. That is all k-mers counts are normalized in the same way, irrespective of the contig/species they belong to. Moreover, when the similarity of two contigs is evaluated as the distance of the corresponding k-mers counts vectors, not all k-mers contributed uniformly to the distance. In fact, because k-mers have different probability to appear, the most probable k-mers can produce a bias in the distance. In this study, we consider this important observation in order to develop a signature based on k-mers statistics that accounts for the different probability of appearance of a k-mer in different species. In general, the pairwise comparison of two sequences, or sets of sequences, can be performed with sophisticated similarity measures, based on k-mers statistics, derived from research in alignment-free statistics [
<xref ref-type="bibr" rid="CR23">23</xref>
<xref ref-type="bibr" rid="CR28">28</xref>
].</p>
<p>Another important aspect is that long contigs carry more information than short contigs. For this reason long contigs, being more representative, they can be effectively grouped into clusters of candidate species, whereas short contigs are often more noisy.</p>
<p>In this paper, we propose MetaCon a method for contig binning based on a new self-standardized k-mers statistics. The main contributions of MetaCon are the following: it learn the different k-mers distributions based on the k-mers probabilities in each contig; it uses the information carried by long contigs to guide the construction of clusters; it can estimate the number of species with a simple iterative procedure. A recent independent benchmark [
<xref ref-type="bibr" rid="CR7">7</xref>
] has reported as the best binning methods CONCOCT [
<xref ref-type="bibr" rid="CR5">5</xref>
] and MetaBat [
<xref ref-type="bibr" rid="CR22">22</xref>
]. We tested MetaCon on simulated and real metagenomes and compared the accuracy of binning with CONCOCT [
<xref ref-type="bibr" rid="CR5">5</xref>
], MetaBat [
<xref ref-type="bibr" rid="CR22">22</xref>
] and MaxBin [
<xref ref-type="bibr" rid="CR20">20</xref>
]. MetaCon performs better in terms of precision, recall and estimated number of species on both simulated and real datasets. The results of this study find that MetaCon can generate high-quality genomes from metagenomics datasets via an automated process, which will enhance our ability to understand complex microbial communities.</p>
</sec>
<sec id="Sec2">
<title>Materials and methods</title>
<p>In this section we present MetaCon and our contribution to the problem of metagenomic contig binning. As we have already observed, most binning tools are based on similarity measures between contigs that are built over k-mers frequency distributions.</p>
<p>However, when dealing with a similarity measure based on k-mers counts there are two major issues. The first is that k-mers might have a different probability to appear in different genomic sequences. The second is that longer contigs carry more information than short contigs, and the direct comparison of the two can be challenging.</p>
<p>The first problem has been extensively studied in the field of alignment-free measures. The latter, suggest that short contigs should be treated differently. MetaCon addresses these problems by proposing a two-phases binning structure in which each phase process one portion of the input dataset, Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
shows the processing pipeline of MetaCon. We will describe the main steps of the method, giving a brief explanation of the reasons why they were undertaken. In the following subsections each step will be described in details.
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>Overview of the MetaCon pipeline</p>
</caption>
<graphic xlink:href="12859_2019_2904_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
<sec id="Sec3">
<title>Contigs representation</title>
<p>Let us assume that we have
<italic>N</italic>
contigs in input that we need to group into bins. First, we construct the feature matrix, using the same notation as [
<xref ref-type="bibr" rid="CR5">5</xref>
], where every row corresponds to a single contig that is represented by a (
<italic>M</italic>
+
<italic>V</italic>
) feature vector where
<italic>M</italic>
is the number of features for the coverage and
<italic>V</italic>
is the number of features for the composition matrix. This feature matrix has size Nx(M+V), and the two sets of features can be computed independently as follows. Similar to CONCOCT [
<xref ref-type="bibr" rid="CR5">5</xref>
], the coverage matrix
<italic>Y</italic>
represents the average coverage of contigs in every data sample. More precisely, Y is a NxM matrix where
<italic>Y</italic>
<sub>
<italic>cm</italic>
</sub>
indicates the coverage of the
<italic>c</italic>
-th contig in the
<italic>m</italic>
-th sample. The composition matrix Z of size NxV represents the frequency of every
<italic>k</italic>
-mers and its reverse complement for the input contigs.</p>
<p>To avoid zero values, a pseudo value is added. For the composition matrix, we add one to it (a relative small number since this matrix counts k-mers frequencies), e.g.,
<inline-formula id="IEq1">
<alternatives>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$Z_{ij}^{\prime }=Z_{ij}+1$\end{document}</tex-math>
<mml:math id="M2">
<mml:msubsup>
<mml:mrow>
<mml:mi>Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">ij</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">ij</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2019_2904_Article_IEq1.gif"></inline-graphic>
</alternatives>
</inline-formula>
, while for the coverage matrix we modified as
<inline-formula id="IEq2">
<alternatives>
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$Y_{ij}^{\prime }=Y_{ij}+0.01$\end{document}</tex-math>
<mml:math id="M4">
<mml:msubsup>
<mml:mrow>
<mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">ij</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">ij</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mn>0.01</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2019_2904_Article_IEq2.gif"></inline-graphic>
</alternatives>
</inline-formula>
(a negligible quantity in terms of coverage).</p>
<p>In order to normalize the coverage matrix, we re-scale it into different ways. Firstly, across the contigs:
<disp-formula id="Equa">
<alternatives>
<tex-math id="M5">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{@{}rcl@{}} Y_{cm}^{\prime\prime} = \frac{Y_{cm}^{\prime}}{\sum_{c=1}^{N}Y_{cm}^{\prime}} \end{array} $$ \end{document}</tex-math>
<mml:math id="M6">
<mml:mtable class="eqnarray" columnalign="left center right">
<mml:mtr>
<mml:mtd class="eqnarray-1">
<mml:msubsup>
<mml:mrow>
<mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cm</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi>′′</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cm</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mrow>
<mml:munderover>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msubsup>
<mml:mrow>
<mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cm</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mfrac>
</mml:mtd>
<mml:mtd class="eqnarray-2"></mml:mtd>
<mml:mtd class="eqnarray-3"></mml:mtd>
<mml:mtd>
<mml:mtext></mml:mtext>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<graphic xlink:href="12859_2019_2904_Article_Equa.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>And it is followed by a normalization across samples, within every contig. The coverage profile matrix after this operation is indicated by Q:
<disp-formula id="Equb">
<alternatives>
<tex-math id="M7">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{@{}rcl@{}} Q_{cm}=\frac{Y_{cm}^{\prime\prime}}{\sum_{m=1}^{M}Y_{cm}{\prime\prime}} \end{array} $$ \end{document}</tex-math>
<mml:math id="M8">
<mml:mtable class="eqnarray" columnalign="left center right">
<mml:mtr>
<mml:mtd class="eqnarray-1">
<mml:msub>
<mml:mrow>
<mml:mi>Q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cm</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cm</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi>′′</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mrow>
<mml:munderover>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>M</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cm</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mi>′′</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mtd>
<mml:mtd class="eqnarray-2"></mml:mtd>
<mml:mtd class="eqnarray-3"></mml:mtd>
<mml:mtd>
<mml:mtext></mml:mtext>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<graphic xlink:href="12859_2019_2904_Article_Equb.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>Each contig
<italic>x</italic>
<sub>
<italic>c</italic>
</sub>
is represented by the M coverage features
<italic>Q</italic>
<sub>
<italic>cm</italic>
</sub>
, with 1≤
<italic>m</italic>
<italic>M</italic>
and 1≤
<italic>c</italic>
<italic>N</italic>
. These normalizations of the coverage matrix have been already used in CONCOCT [
<xref ref-type="bibr" rid="CR5">5</xref>
], and other methods.</p>
<p>In this paper we are interested in building a better feature vector for the k-mer signature, which serves for the following procedure. We observe that the length of contigs plays an important role with respect to the quality of the k-mer signature. Indeed, short contigs may not be a good representer for the genome as they do not carry too much information about the genome, they may not capture the different distributions of k-mers as well as long contigs. Furthermore, since the clustering method (e.g., k-medoids) starts from random contigs as centroids, if it happens to be the short contigs, the clustering performance will somewhat degrade. We try to address this issue by splitting the whole dataset into two parts, based on contig lengths. Long contigs will be clustered in the first phase, whereas short contigs will be treated in the second phase.</p>
</sec>
<sec id="Sec4">
<title>Phase 1: self-standardized k-mers statistics</title>
<p>Inspired by the recent developments in the field of alignment-free statistics we propose here a novel similarity measure based on probabilistic k-mers statistics for the comparison of two contigs. The idea is to account for the different distribution of
<italic>k</italic>
-mers counts, in different contigs, and to remove the bias generated by contigs of different length in a probabilistic framework with a self-standardized k-mers statistics. Note that this only applies on the long contigs, whereas we do nothing for the short contigs.</p>
<p>Let’s define contig
<italic>x</italic>
<sub>
<italic>c</italic>
</sub>
, as a sequence of characters from the alphabet
<italic>Σ</italic>
={
<italic>A,C,G,T</italic>
}. Let’s say
<italic>X</italic>
<sub>
<italic>cw</italic>
</sub>
is the frequency of the
<italic>k</italic>
-mer
<italic>w</italic>
in the contig
<italic>x</italic>
<sub>
<italic>c</italic>
</sub>
. Given that contigs are sequenced from both strands of a genome,
<italic>X</italic>
<sub>
<italic>cw</italic>
</sub>
also includes the contribution of the reversed complement of
<italic>w</italic>
. If
<italic>k</italic>
is smaller than the logarithm of the length of contigs,
<italic>k</italic>
<
<italic>l</italic>
<italic>o</italic>
<italic>g</italic>
|
<italic>x</italic>
<sub>
<italic>c</italic>
</sub>
|, we can consider the variables
<italic>X</italic>
<sub>
<italic>cw</italic>
</sub>
as Binomial, in line with other studies [
<xref ref-type="bibr" rid="CR29">29</xref>
,
<xref ref-type="bibr" rid="CR30">30</xref>
]. Similarly to other methods [
<xref ref-type="bibr" rid="CR22">22</xref>
], MetaCon will use
<italic>k</italic>
=4, as described in result section, thus this approximation holds.</p>
<p>To account for the different probability of appearance of
<italic>k</italic>
-mers, we standardize the variables
<italic>X</italic>
<sub>
<italic>cw</italic>
</sub>
in the following way. For the sequence
<italic>x</italic>
<sub>
<italic>c</italic>
</sub>
, let
<inline-formula id="IEq3">
<alternatives>
<tex-math id="M9">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$p_{c}^{j}(a)$\end{document}</tex-math>
<mml:math id="M10">
<mml:msubsup>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>(</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>)</mml:mo>
</mml:math>
<inline-graphic xlink:href="12859_2019_2904_Article_IEq3.gif"></inline-graphic>
</alternatives>
</inline-formula>
be the probability of the symbol
<italic>a</italic>
in position
<italic>j</italic>
in
<italic>x</italic>
<sub>
<italic>c</italic>
</sub>
. If we assume that the symbols at different positions are independent and identically distributed, we can simplify
<inline-formula id="IEq4">
<alternatives>
<tex-math id="M11">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$p_{c}^{j}$\end{document}</tex-math>
<mml:math id="M12">
<mml:msubsup>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2019_2904_Article_IEq4.gif"></inline-graphic>
</alternatives>
</inline-formula>
and denote it by
<italic>p</italic>
<sub>
<italic>c</italic>
</sub>
. This i.i.d. model has been widely used in field of pattern statistics [
<xref ref-type="bibr" rid="CR31">31</xref>
,
<xref ref-type="bibr" rid="CR32">32</xref>
]. Based on this assumption, we define the probability of a
<italic>k</italic>
-mer
<italic>w</italic>
=
<italic>w</italic>
<sub>1</sub>
<italic>w</italic>
<sub>2</sub>
...
<italic>w</italic>
<sub>
<italic>k</italic>
</sub>
to occur at a given position in the contig
<italic>x</italic>
<sub>
<italic>c</italic>
</sub>
as
<inline-formula id="IEq5">
<alternatives>
<tex-math id="M13">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$P_{cw}=\prod _{i=1}^{k}{p_{c}(w_{i})}$\end{document}</tex-math>
<mml:math id="M14">
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cw</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:math>
<inline-graphic xlink:href="12859_2019_2904_Article_IEq5.gif"></inline-graphic>
</alternatives>
</inline-formula>
, that again is independent of the position of
<italic>x</italic>
<sub>
<italic>c</italic>
</sub>
.</p>
<p>Now, we recall that
<italic>X</italic>
<sub>
<italic>cw</italic>
</sub>
is a Binomial and that the k-mer
<italic>w</italic>
has probability to occur
<italic>P</italic>
<sub>
<italic>cw</italic>
</sub>
, thus can compute mean and variance of
<italic>X</italic>
<sub>
<italic>cw</italic>
</sub>
as:
<disp-formula id="Equ1">
<label>1</label>
<alternatives>
<tex-math id="M15">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{@{}rcl@{}} E[X_{cw}] = \mu_{cw} = P_{cw} L(x_{c}) \end{array} $$ \end{document}</tex-math>
<mml:math id="M16">
<mml:mtable class="eqnarray" columnalign="left center right">
<mml:mtr>
<mml:mtd class="eqnarray-1">
<mml:mi>E</mml:mi>
<mml:mo>[</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cw</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>]</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>μ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cw</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cw</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mi>L</mml:mi>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<graphic xlink:href="12859_2019_2904_Article_Equ1.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>
<disp-formula id="Equ2">
<label>2</label>
<alternatives>
<tex-math id="M17">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{@{}rcl@{}} Var(X_{cw}) = (\sigma_{cw})^{2} = P_{cw}(1 - P_{cw}) L(x_{c}) \end{array} $$ \end{document}</tex-math>
<mml:math id="M18">
<mml:mtable class="eqnarray" columnalign="left center right">
<mml:mtr>
<mml:mtd class="eqnarray-1">
<mml:mtext mathvariant="italic">Var</mml:mtext>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cw</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>σ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cw</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cw</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cw</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mi>L</mml:mi>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<graphic xlink:href="12859_2019_2904_Article_Equ2.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>where
<italic>L</italic>
(
<italic>x</italic>
<sub>
<italic>c</italic>
</sub>
) is the length of the contig
<italic>x</italic>
<sub>
<italic>c</italic>
</sub>
. Thus, the k-mers counts
<italic>X</italic>
<sub>
<italic>cw</italic>
</sub>
can be standardized, as a z-score, as follows:
<disp-formula id="Equ3">
<label>3</label>
<alternatives>
<tex-math id="M19">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{@{}rcl@{}} \widetilde{X}_{cw} = \frac{{X_{cw}} - \mu_{cw}}{\sigma_{cw}} \end{array} $$ \end{document}</tex-math>
<mml:math id="M20">
<mml:mtable class="eqnarray" columnalign="left center right">
<mml:mtr>
<mml:mtd class="eqnarray-1">
<mml:msub>
<mml:mrow>
<mml:mover accent="false">
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cw</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cw</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>μ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cw</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>σ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cw</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<graphic xlink:href="12859_2019_2904_Article_Equ3.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>As already observed the frequency of
<italic>k</italic>
-mers in different genomes can greatly vary. Similarly, it is difficult to estimate the probability
<italic>P</italic>
<sub>
<italic>cw</italic>
</sub>
, as it does not follow the same model for different genomes. Thus we need to estimate
<italic>P</italic>
<sub>
<italic>cw</italic>
</sub>
directly from the contig. We define
<italic>n</italic>
<sub>
<italic>c</italic>
</sub>
(
<italic>a</italic>
), with
<italic>a</italic>
∈{
<italic>G,T,A,C</italic>
}, as the number of occurrences of the nucleotide
<italic>a</italic>
in the contig
<italic>x</italic>
<sub>
<italic>c</italic>
</sub>
. Then, we can estimate the probability of the symbol
<italic>a</italic>
in the contig
<italic>x</italic>
<sub>
<italic>c</italic>
</sub>
as,
<disp-formula id="Equc">
<alternatives>
<tex-math id="M21">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{@{}rcl@{}} p_{c}(a)=\frac{n_{c}(a)} {L(x_{c})} \end{array} $$ \end{document}</tex-math>
<mml:math id="M22">
<mml:mtable class="eqnarray" columnalign="left center right">
<mml:mtr>
<mml:mtd class="eqnarray-1">
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mtd>
<mml:mtd class="eqnarray-2"></mml:mtd>
<mml:mtd class="eqnarray-3"></mml:mtd>
<mml:mtd>
<mml:mtext></mml:mtext>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<graphic xlink:href="12859_2019_2904_Article_Equc.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>To summarize, we start from the raw k-mers counts directly obtained from matrix
<italic>Z</italic>
<sup></sup>
, for each contig we can compute the probabilities
<italic>P</italic>
<sub>
<italic>cw</italic>
</sub>
and build a probabilistic k-mers statistics
<inline-formula id="IEq6">
<alternatives>
<tex-math id="M23">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\widetilde {X}_{cw}$\end{document}</tex-math>
<mml:math id="M24">
<mml:msub>
<mml:mrow>
<mml:mover accent="false">
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cw</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="12859_2019_2904_Article_IEq6.gif"></inline-graphic>
</alternatives>
</inline-formula>
by using formula (3). Similar to the normalization applied to the coverage features, the probabilistic k-mers statistics
<inline-formula id="IEq7">
<alternatives>
<tex-math id="M25">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\widetilde {X}_{cw}$\end{document}</tex-math>
<mml:math id="M26">
<mml:msub>
<mml:mrow>
<mml:mover accent="false">
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cw</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="12859_2019_2904_Article_IEq7.gif"></inline-graphic>
</alternatives>
</inline-formula>
is column-wise normalized (normalization across contigs), as H:
<disp-formula id="Equd">
<alternatives>
<tex-math id="M27">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{@{}rcl@{}} H_{cw} = \frac{\widetilde{X}_{cw}}{\sum_{c} {\widetilde{X}_{cw}} } \end{array} $$ \end{document}</tex-math>
<mml:math id="M28">
<mml:mtable class="eqnarray" columnalign="left center right">
<mml:mtr>
<mml:mtd class="eqnarray-1">
<mml:msub>
<mml:mrow>
<mml:mi>H</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cw</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="false">
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cw</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:munder>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msub>
<mml:mrow>
<mml:mover accent="false">
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">cw</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mtd>
<mml:mtd class="eqnarray-2"></mml:mtd>
<mml:mtd class="eqnarray-3"></mml:mtd>
<mml:mtd>
<mml:mtext></mml:mtext>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<graphic xlink:href="12859_2019_2904_Article_Equd.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>Finally, the feature matrix F of long contigs is assembled as
<italic>F</italic>
=[
<italic>Q</italic>
<italic>H</italic>
], as the combination of the coverage profile
<italic>Q</italic>
and probabilistic k-mers profile H. Then, the relatedness of a pair of contigs can be evaluated by L2 distance of the corresponding feature vectors. Here we use k-medoids clustering method ([
<xref ref-type="bibr" rid="CR33">33</xref>
]), a variant of k-means with the feature matrix F as input.</p>
</sec>
<sec id="Sec5">
<title>Phase 2: dealing with short contigs</title>
<p>In the first phase we filter short contigs, and we build clusters with the k-medoids algorithm by using the feature matrix F mentioned above. We process only long contigs in the first phase, because they are more informative in terms of k-mers statistics. We believe that the underlying structure of every species can be well unveiled in the first stage when we get rid of the short contigs from the dataset. In fact, the clusters produced in the first phase will have high precision (see result section), because they are more distinguishable and less noisy. These clusters will be used as a basis for the assignment of short contigs in the second phase.</p>
<p>The second subset contains the short contigs, and we decided to assign them to the already classified clusters (output after the first phase) according to the shortest L1 distance. The profile matrix is the union of the composition and coverage matrices of the short contigs. Note that the composition matrix for short contigs is not normalized. L1 distance is an alternative method to measure the similarities between two multi-dimensional data by computing the absolute distance. In our case, we observed that L1 works better than L2 (Euclidean distance) in the second stage. We think that the L1 distance may somewhat amplify the differences better than L2 in the second-phase where short contigs are less representative.</p>
<p>An overview of MetaCon is presented in Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
. Here we summarize the overall procedure.
<list list-type="order">
<list-item>
<p>Compute the composition and coverage matrices.</p>
</list-item>
<list-item>
<p>Normalize the coverage matrix.</p>
</list-item>
<list-item>
<p>Estimate the number of clusters: C.</p>
</list-item>
<list-item>
<p>Split the dataset into two subsets: long and short contigs.</p>
</list-item>
<list-item>
<p>First-phase: Compute the probabilistic k-mers signature and normalize the composition matrix of long contigs.</p>
</list-item>
<list-item>
<p>Clustering long contigs by k-medoids
<list list-type="alpha-lower">
<list-item>
<p>Initialization: randomly select C contigs as the medoids.</p>
</list-item>
<list-item>
<p>Assignment step: Associate each contig to the closest medoid.</p>
</list-item>
<list-item>
<p>Update step: For each medoid
<italic>m</italic>
and each contig
<italic>c</italic>
associated to
<italic>m</italic>
swap
<italic>m</italic>
and
<italic>c</italic>
and compute the total cost of the new configuration, based on the average dissimilarity of
<italic>c</italic>
to all contigs associated to
<italic>m</italic>
. Select the medoid
<italic>c</italic>
with the lowest configuration cost.</p>
</list-item>
<list-item>
<p>Repeat steps b and c until there is no change in the assignments.</p>
</list-item>
</list>
</p>
</list-item>
<list-item>
<p>Second-phase: Assign the short contigs to the closest centroid by L1 distance.</p>
</list-item>
</list>
</p>
</sec>
<sec id="Sec6">
<title>Estimating the number of species</title>
<p>As we know, estimating the real number of clusters is one of the most challenging problem. The difficulty primarily attributes to the absence of prior knowledge of the data, in the case of metagenomics the real number of species in the dataset is not known. Moreover, there is no general criteria that may well assess the clusters when we encounter different datasets, in particular, when the number of clusters is big and the data has high-dimensional. Despite some methods that are tailored for the datasets with known distribution, here instead we use an easy and intuitive method to estimate the number of species. We exhaustively iterate the k-means by starting from a small number of clusters and gradually increase it until some criteria is met. This procedure stops when the non-empty clusters are less than 80% of the candidate number of cluster in the corresponding iteration. This iterative procedure might be computationally demanding, to speed up the computation in this paper we use an efficient library implementation [
<xref ref-type="bibr" rid="CR34">34</xref>
].</p>
</sec>
</sec>
<sec id="Sec7">
<title>Results and discussion</title>
<p>In order to validate our contribution, we compare it with the commonly known methods CONCOCT, MaxBin 2.0 and MetaBat. In particular, CONCOCT [
<xref ref-type="bibr" rid="CR5">5</xref>
] and MetaBat [
<xref ref-type="bibr" rid="CR22">22</xref>
] have been reported to be the best performing methods in a recent independent benchmark [
<xref ref-type="bibr" rid="CR7">7</xref>
]. All of these tools use as input the composition and coverage matrices, as MetaCon does. MaxBin 2.0 [
<xref ref-type="bibr" rid="CR20">20</xref>
] estimates the probability that a contig belongs to a bin based on expectation-maximization (EM). MetaBat [
<xref ref-type="bibr" rid="CR22">22</xref>
] starts from one bin, and gradually assigns the contigs to that bin until the centroid does not change, repeatedly for several bins until no contigs are left. CONCOCT [
<xref ref-type="bibr" rid="CR5">5</xref>
] applies PCA (principal component analysis) to the feature matrix (composed by coverage and composition matrices) for the sake of dimension reduction and afterward it uses a Gaussian mixture model.</p>
<sec id="Sec8">
<title>Synthetic and real datasets</title>
<p>Before the discussion of the results, here we give a brief introduction of the datasets. In this paper, we test the methods on both synthetic and real metagenomic datasets. A complete description of the dataset construction can be found in [
<xref ref-type="bibr" rid="CR5">5</xref>
], here for completeness we report a brief summary. In CONCOCT [
<xref ref-type="bibr" rid="CR5">5</xref>
], the authors simulate two mocked communities of microbiomes in order to test the performance, called ’Strain’ and ’Species’ datasets. Both of these synthetic datasets are built on 16S rRNA samples involved in the Human Microbiome Project (HMP, [
<xref ref-type="bibr" rid="CR35">35</xref>
]). The samples have gone through denoise operation and the OTUs were generated by the standard that 3% sequence differences to approximated species exist. The contigs were assembled from the reads in samples and the reads were subsequently mapped back onto contigs to get the coverage information.</p>
<p>For simulated and real data, a co-assembly of reads from all samples was performed using Ray [
<xref ref-type="bibr" rid="CR36">36</xref>
]. Ray was used to generate the co-assembled contigs because it is able to handle large metagenomic dataset. Contigs were cut up into non-overlapping fragments of 10 kilobases in order mitigate the effect of local assembly errors (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S1 in supplementary material reports the contig length distribution).</p>
<p>Specifically, the ’Strain’ dataset contains 9417 contigs, which are co-assembled from 64 samples, and it contains totally 20 organisms. The simulated ’Strain’ community is composed by five different
<italic>Escherichia coli</italic>
strains, five different
<italic>Bacteroides</italic>
species, five different
<italic>gut</italic>
bacteria and the rest from
<italic>Clostridium</italic>
.</p>
<p>The ’Species’ dataset has 101 different species, including 37628 contigs, co-assembled from 96 samples. For the ’Species’ dataset, OTUs are removed when its total count is less then 20 across samples. This dataset aims at testing the ability to discriminate at species-level. The complete information for the datasets can be found in the supplementary material.</p>
<p>’Sharon’ [
<xref ref-type="bibr" rid="CR37">37</xref>
] is a real dataset, and it is generated from the microbiome samples of the premature infants. It contains 18 data samples, and due to the fact that we do not know the true species labels, we used TAXAassign [
<xref ref-type="bibr" rid="CR38">38</xref>
] to annotate the contigs. It ended up with 7 species, 2599 contigs, after we filtered the contigs with ambiguous labels at species level.</p>
</sec>
<sec id="Sec9">
<title>Evaluation criteria</title>
<p>Precision and recall are commonly used to compare the performance of the binning algorithms under assessment. Precision measures the ability of the approach to build clusters composed by contigs from the same species. On the other hand, recall measures the ability of gathering all the contigs of a given species. Namely, the precision tests the correctness, and the recall tests the completeness. Therefore, when evaluating the performance of a binning method one should take into account both aspects in order to obtain a comprehensive evaluation.</p>
<p>Let
<italic>n</italic>
be the number of species in a metagenomic dataset, and
<italic>C</italic>
be the number of clusters returned by the algorithm. Let
<italic>A</italic>
<sub>
<italic>ij</italic>
</sub>
be the number of contigs from species
<italic>j</italic>
assigned to cluster
<italic>i</italic>
. Following the definitions in [
<xref ref-type="bibr" rid="CR39">39</xref>
], for the precision we find the species with the maximum number of contigs in every cluster and sum them up, divided by the total number of contigs. As for the recall, we select the cluster with the maximal number of contigs from a given species, and again accumulate them, divided by the total number of contigs.
<disp-formula id="Eque">
<alternatives>
<tex-math id="M29">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{@{}rcl@{}} Precision=\frac{\sum_{i=1}^{C} {max}_{j} A_{ij}}{\sum_{i=1}^{C}\sum_{j=1}^{n} A_{ij}} \end{array} $$ \end{document}</tex-math>
<mml:math id="M30">
<mml:mtable class="eqnarray" columnalign="left center right">
<mml:mtr>
<mml:mtd class="eqnarray-1">
<mml:mtext mathvariant="italic">Precision</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:munderover>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">max</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">ij</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:munderover>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:munderover>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">ij</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mtd>
<mml:mtd class="eqnarray-2"></mml:mtd>
<mml:mtd class="eqnarray-3"></mml:mtd>
<mml:mtd>
<mml:mtext></mml:mtext>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<graphic xlink:href="12859_2019_2904_Article_Eque.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>
<disp-formula id="Equf">
<alternatives>
<tex-math id="M31">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{@{}rcl@{}} Recall=\frac{\sum_{j=1}^{n} {max}_{i}A_{ij}}{\sum_{i=1}^{C}\sum_{j=1}^{n} A_{ij}} \end{array} $$ \end{document}</tex-math>
<mml:math id="M32">
<mml:mtable class="eqnarray" columnalign="left center right">
<mml:mtr>
<mml:mtd class="eqnarray-1">
<mml:mtext mathvariant="italic">Recall</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:munderover>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">max</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">ij</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:munderover>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:munderover>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">ij</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mtd>
<mml:mtd class="eqnarray-2"></mml:mtd>
<mml:mtd class="eqnarray-3"></mml:mtd>
<mml:mtd>
<mml:mtext></mml:mtext>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<graphic xlink:href="12859_2019_2904_Article_Equf.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
</sec>
<sec id="Sec10">
<title>Results on Synthetic and Real Datasets</title>
<p>In the first experiment, we assess the ability of MetaCon to predict the number of clusters. The average result is reported in (Table 
<xref rid="Tab1" ref-type="table">1</xref>
). CONCOCT needs a maximal number of cluster in input, the other methods do not. In this first experiment, MetaCon outperforms the other methods by estimating the number of clusters close to the real number of species.
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>Estimated number of clusters for different methods (best results are in bold)</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Dataset</th>
<th align="left">Real value</th>
<th align="left">CONCOCT</th>
<th align="left">MaxBin</th>
<th align="left">MetaCon</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Strain dataset</td>
<td align="left">20</td>
<td align="left">
<bold>21</bold>
</td>
<td align="left">17</td>
<td align="left">
<bold>21</bold>
</td>
</tr>
<tr>
<td align="left">Species dataset</td>
<td align="left">101</td>
<td align="left">84</td>
<td align="left">114</td>
<td align="left">
<bold>106</bold>
</td>
</tr>
<tr>
<td align="left">Sharon dataset</td>
<td align="left">7</td>
<td align="left">
<bold>8</bold>
</td>
<td align="left">5</td>
<td align="left">
<bold>6</bold>
</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>In next series of tests we evaluated the performance of MetaCon on the datasets against the other tools. MetaCon outperforms all other methods in terms of precision and recall, as shown in Figs. 
<xref rid="Fig2" ref-type="fig">2</xref>
,
<xref rid="Fig3" ref-type="fig">3</xref>
and
<xref rid="Fig4" ref-type="fig">4</xref>
. The precision and recall are above 95% for both simulated data and real data. For the ’Strain’ dataset (Fig. 
<xref rid="Fig2" ref-type="fig">2</xref>
), the precision by MetaCon is about 97.5
<italic>%</italic>
, that is better than the other three methods; the recall is 95.8
<italic>%</italic>
, higher than MaxBin and MetaBat, almost identical with CONCOCT. For the ’Species’ dataset, shown in Fig. 
<xref rid="Fig3" ref-type="fig">3</xref>
, it is challenging to bin the contigs since the number of species is large, MetaCon reaches 99.3
<italic>%</italic>
in terms of precision and 94.6
<italic>%</italic>
for the recall. Again, the comparison with other tools reveals an outcome similar to the dataset ’Strain’. For the real dataset ’Sharon’, Fig. 
<xref rid="Fig4" ref-type="fig">4</xref>
, the results are in line with those of the synthetic datasets. MetaCon achieves higher precision and recall with respect to the other tools. The only notable difference is that on this datasets MetaBat has a precision similar to MetaCon but again a lower recall.
<fig id="Fig2">
<label>Fig. 2</label>
<caption>
<p>Comparison of precision and recall for
<bold>Strain</bold>
dataset</p>
</caption>
<graphic xlink:href="12859_2019_2904_Fig2_HTML" id="MO2"></graphic>
</fig>
<fig id="Fig3">
<label>Fig. 3</label>
<caption>
<p>Comparison of precision and recall for
<bold>Species</bold>
dataset</p>
</caption>
<graphic xlink:href="12859_2019_2904_Fig3_HTML" id="MO3"></graphic>
</fig>
<fig id="Fig4">
<label>Fig. 4</label>
<caption>
<p>Comparison of precision and recall for
<bold>Sharon</bold>
dataset</p>
</caption>
<graphic xlink:href="12859_2019_2904_Fig4_HTML" id="MO4"></graphic>
</fig>
</p>
<p>Additionally, we evaluate the quality of bins generated by different methods for Strain dataset. In order to evaluate the contamination and completeness of the bins, we filtered out the bins whose precision is less than 80%, reported in Fig. 
<xref rid="Fig5" ref-type="fig">5</xref>
a, where the different shades of gray indicates the different level of recall. In Fig. 
<xref rid="Fig5" ref-type="fig">5</xref>
b, we report the opposite procedure where we assess the precision of bins after filtering out bins with recall lower than 80%. For example, in Fig. 
<xref rid="Fig5" ref-type="fig">5</xref>
a the number of clusters with precision greater than 80% and recall greater than 95% is 16 for MetaCon, for CONCOCT 11 and for MaxBin 4. MetaCon outperforms the other methods, firstly MetaCon has more bins left after screening in both Fig. 
<xref rid="Fig5" ref-type="fig">5</xref>
a and b. Secondly, the bins produced by MetaCon mostly resides in the high-level range of precision and recall. We think that the primary reason for the good performance of MetaCon is that the first-stage builds high-quality clusters, they may better represent the relative species and capture the different traits of species. In addition, the k-medoids may relieve the negative influence caused by the outliers since it considers the median value instead of the mean during the clustering process, and probably it further consolidates the structures of clusters.
<fig id="Fig5">
<label>Fig. 5</label>
<caption>
<p>The quality of bins generated by different methods.
<bold>a</bold>
After filtering out the bins, whose precision is lower than 80%, we compare the recall located in different range for different methods. The array marked in white indicates the number of bins in the corresponding recall. The thin stripes represent absence of bins.
<bold>b</bold>
After filtering out the bins, whose recall is lower than 80%, we compare the precision located in different range for different methods. The array marked in white indicates the number of bins in the corresponding precision. The thin stripes represent absence of bins</p>
</caption>
<graphic xlink:href="12859_2019_2904_Fig5_HTML" id="MO5"></graphic>
</fig>
</p>
</sec>
<sec id="Sec11">
<title>Parameters: k-mers size</title>
<p>In this section, we want to discuss how to choose the parameters
<italic>k</italic>
for MetaCon and show the results under different conditions. The selection of the k-mers size is critical when we build our probabilistic k-mers signature, if k is too small (k=2), it will result a less representative and informative feature matrix as only 16 features of composition matrix generated, that is not sufficient to differentiate between contigs from diverse species, specially, when some of them are closely related.</p>
<p>With this series of experiments we want to evaluate the influence of k-mers size for MetaCon over the different datasets. The results of precision and recall are reported in Figs. 
<xref rid="Fig6" ref-type="fig">6</xref>
and
<xref rid="Fig7" ref-type="fig">7</xref>
. Note that the results are obtained with the correct number of clusters used as input since here we want to compare the various choice for the size of k-mers. For the ’Strain’ dataset, the precision increases from 93% to 97% when k varies from 2 to 6, and the precision is identical when k equal to 4 and 6, when k equal to 8 the precision decreases. The recall of ’Strain’ dataset follows the similar trend of the precision. As for the ’Species’ dataset, the precision changes from 95% to 99% when k increases from 2 to 6, and it achieves 99% when k equal to 6, slightly better than with k equal to 4. The ’Sharon’ dataset keeps the precision and recall constant when we modify k as the number of species is small, probably less information is required in order to cluster the contigs with respect to the other two datasets that contain more species. Generally speaking, k equal to 4 could be a good choice, considering precision, recall and computing time. A similar value was used also in MetaBat [
<xref ref-type="bibr" rid="CR22">22</xref>
].
<fig id="Fig6">
<label>Fig. 6</label>
<caption>
<p>MetaCon precision for different datasets by varying k-mers size</p>
</caption>
<graphic xlink:href="12859_2019_2904_Fig6_HTML" id="MO6"></graphic>
</fig>
<fig id="Fig7">
<label>Fig. 7</label>
<caption>
<p>MetaCon recall for different datasets by varying k-mers size</p>
</caption>
<graphic xlink:href="12859_2019_2904_Fig7_HTML" id="MO7"></graphic>
</fig>
</p>
</sec>
<sec id="Sec12">
<title>The importance of contig length distribution</title>
<p>Another factor we want to address here is the importance of the length of contigs. Recall that in the first phase of MetaCon, we process only long contigs, and in the second phase we assign the short contigs. We want to evaluate the impact of this approach by showing how precision and recall varies in the two phases and to compare the results when we use all the contigs at once. The results of these experiments are reported in Tables 
<xref rid="Tab2" ref-type="table">2</xref>
and
<xref rid="Tab3" ref-type="table">3</xref>
. If we do not process the long and short contigs separately in two-phases we can observed that the precision obtained by MetaCon is respectively 93.79
<italic>%</italic>
and 97.23
<italic>%</italic>
for the two datasets ’Strain’ and ’Species’. We can notice that the precision can be improved by separately processing the contigs: for the ’Strain’ dataset, it increases from 93.79
<italic>%</italic>
to 97.46
<italic>%</italic>
, and for ’Species’ it improves from 97.23
<italic>%</italic>
to 99.56
<italic>%</italic>
. A similar behavior is observed for the recall. For the ’Strain’ dataset the recall increases slightly from 95.23
<italic>%</italic>
to 95.78
<italic>%</italic>
, and for the ’Species’ dataset it improves from 90.95
<italic>%</italic>
to 95.04
<italic>%</italic>
.
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>Precision of MetaCon after the different phases, compared with the precision considering all contigs at once</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Dataset</th>
<th align="left">First-phase</th>
<th align="left">Second-phase</th>
<th align="left">All contigs</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Strain</td>
<td align="left">98.70%</td>
<td align="left">97.46%</td>
<td align="left">93.79%</td>
</tr>
<tr>
<td align="left">Species</td>
<td align="left">99.88%</td>
<td align="left">99.56%</td>
<td align="left">97.23%</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="Tab3">
<label>Table 3</label>
<caption>
<p>Recall of MetaCon after the different phases, compared with the recall considering all contigs at once</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Dataset</th>
<th align="left">First-phase</th>
<th align="left">Second-phase</th>
<th align="left">All contigs</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Strain</td>
<td align="left">75.05%</td>
<td align="left">95.78%</td>
<td align="left">95.23%</td>
</tr>
<tr>
<td align="left">Species</td>
<td align="left">80.86%</td>
<td align="left">95.04%</td>
<td align="left">90.95%</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>In order to choose a good threshold to split short and long contigs, we experimented with different values (see Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S2 in supplementary material). Empirically we found that a good choice is to have about 20% of the contigs to be labelled as short. Based on these results we selected 2000bp as a good compromise.</p>
</sec>
<sec id="Sec13">
<title>Assignment of short contig: L1 vs L2 distance</title>
<p>In the second phase short contigs are assigned to the closest centroid by L1 distance. Here, we evaluate the effect of L1 distance in comparison with L2 distance. Figure 
<xref rid="Fig8" ref-type="fig">8</xref>
reports the precision of MetaCon for all datasets individually by L1 and L2 distance in the second phase. For this stage, the L1 distance outperforms L2 for all of the datasets. In particular, in the ’Strain’ dataset L1 boosts the performance from 89.56
<italic>%</italic>
to 97.46
<italic>%</italic>
, for ’Sharon’ dataset, the precision increases from 79.11
<italic>%</italic>
to 97.38
<italic>%</italic>
by using L1. We think, when it comes to assign the short contigs to the closest cluster centroid, L1 reveals its strength by amplifying the differences between contigs.
<fig id="Fig8">
<label>Fig. 8</label>
<caption>
<p>Comparison of L1 and L2 distance in the second-phase of MetaCon</p>
</caption>
<graphic xlink:href="12859_2019_2904_Fig8_HTML" id="MO8"></graphic>
</fig>
</p>
</sec>
</sec>
<sec id="Sec14" sec-type="conclusion">
<title>Conclusion</title>
<p>Binning metagenomic contigs remains a crucial step in metagenomic analysis. In this work we presented MetaCon, an unsupervised approach for metagenomic binning based on probabilistic k-mers statistics and coverage. Our approach instead of processing the whole dataset at once as most methods, it splits the input and process them into two separate phases of MetaCon. We compared the binning performance over synthetic and real metagenomic datasets against other state-of-art binning algorithms, showing that MetaCon achieves good performances in terms of precision, recall and estimating the number of species.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Additional file</title>
<sec id="Sec15">
<p>
<supplementary-material content-type="local-data" id="MOESM1">
<media xlink:href="12859_2019_2904_MOESM1_ESM.pdf">
<label>Additional file 1</label>
<caption>
<p>Supplementary Material. (PDF 205 kb)</p>
</caption>
</media>
</supplementary-material>
</p>
</sec>
</sec>
</body>
<back>
<glossary>
<title>Abbreviations</title>
<def-list>
<def-item>
<term>GMM</term>
<def>
<p>Gaussian mixture model</p>
</def>
</def-item>
<def-item>
<term>HMP</term>
<def>
<p>Human microbiome project</p>
</def>
</def-item>
<def-item>
<term>OTU</term>
<def>
<p>Operational taxonomic unit</p>
</def>
</def-item>
<def-item>
<term>PCA</term>
<def>
<p>Principal component analysis</p>
</def>
</def-item>
</def-list>
</glossary>
<fn-group>
<fn>
<p>
<bold>Authors’ information</bold>
</p>
<p>Department of Information Engineering, University of Padova via Gradenigo 6/A, 35131 Padova - Italy</p>
<p>Email addresses: JQ (jia.qian@studenti.unipd.it), MC (comin@dei.unipd.it).</p>
</fn>
</fn-group>
<ack>
<p>Not applicable.</p>
<sec id="d29e2244">
<title>Funding</title>
<p>Publication of this article did not receive sponsorship.</p>
</sec>
<sec id="d29e2249" sec-type="data-availability">
<title>Availability of data and materials</title>
<p>The software is freely available for academic use at:
<ext-link ext-link-type="uri" xlink:href="http://www.dei.unipd.it/٪7Eciompin/main/metacon.html">http://www.dei.unipd.it/٪7Eciompin/main/metacon.html</ext-link>
</p>
</sec>
<sec id="d29e2258">
<title>About this supplement</title>
<p>This article has been published as part of BMC Bioinformatics, Volume 20 Supplement 9, 2019: Italian Society of Bioinformatics (BITS): Annual Meeting 2018. The full contents of the supplement are available at
<ext-link ext-link-type="uri" xlink:href="https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-20-supplement-9">https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-20-supplement-9</ext-link>
</p>
</sec>
</ack>
<notes notes-type="author-contribution">
<title>Authors’ contributions</title>
<p>All authors contributed to the design of the approach and to the analysis of the results. JQ implemented MetaCon software and performed the experiments. JQ and MC conceived the study and drafted the manuscript. All authors have read and approved the manuscript for publication.</p>
</notes>
<notes>
<title>Ethics approval and consent to participate</title>
<p>Not applicable.</p>
</notes>
<notes>
<title>Consent for publication</title>
<p>Not applicable.</p>
</notes>
<notes notes-type="COI-statement">
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</notes>
<notes>
<title>Publisher’s Note</title>
<p>Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.</p>
</notes>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Staley</surname>
<given-names>JT</given-names>
</name>
<name>
<surname>Konopka</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Measurement of in situ activities of nonphotosynthetic microorganisms in aquatic and terrestrial habitats</article-title>
<source>Ann Rev Microbiol</source>
<year>1985</year>
<volume>39</volume>
<issue>1</issue>
<fpage>321</fpage>
<lpage>46</lpage>
<pub-id pub-id-type="doi">10.1146/annurev.mi.39.100185.001541</pub-id>
<pub-id pub-id-type="pmid">3904603</pub-id>
</element-citation>
</ref>
<ref id="CR2">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Handelsman</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Rondon</surname>
<given-names>MR</given-names>
</name>
<name>
<surname>Brady</surname>
<given-names>SF</given-names>
</name>
<name>
<surname>Clardy</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Goodman</surname>
<given-names>RM</given-names>
</name>
</person-group>
<article-title>Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products</article-title>
<source>Chem Biol</source>
<year>1998</year>
<volume>5</volume>
<issue>10</issue>
<fpage>245</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1016/S1074-5521(98)90108-9</pub-id>
</element-citation>
</ref>
<ref id="CR3">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Felczykowska</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Bloch</surname>
<given-names>SK</given-names>
</name>
<name>
<surname>Nejman-Faleńczyk</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Barańska</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Metagenomic approach in the investigation of new bioactive compounds in the marine environment</article-title>
<source>Acta Biochim Pol</source>
<year>2012</year>
<volume>59</volume>
<issue>4</issue>
<fpage>501</fpage>
<lpage>5</lpage>
<pub-id pub-id-type="doi">10.18388/abp.2012_2084</pub-id>
<pub-id pub-id-type="pmid">23251909</pub-id>
</element-citation>
</ref>
<ref id="CR4">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mande</surname>
<given-names>SS</given-names>
</name>
<name>
<surname>Mohammed</surname>
<given-names>MH</given-names>
</name>
<name>
<surname>Ghosh</surname>
<given-names>TS</given-names>
</name>
</person-group>
<article-title>Classification of metagenomic sequences: methods and challenges</article-title>
<source>Brief Bioinforma</source>
<year>2012</year>
<volume>13</volume>
<issue>6</issue>
<fpage>669</fpage>
<lpage>81</lpage>
<pub-id pub-id-type="doi">10.1093/bib/bbs054</pub-id>
</element-citation>
</ref>
<ref id="CR5">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Alneberg</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Brynjar Smári</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Ino</surname>
<given-names>DB</given-names>
</name>
<name>
<surname>Melanie</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Joshua</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Umer Z</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Leo</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Nicholas J</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Anders F</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Christopher</surname>
<given-names>Q</given-names>
</name>
</person-group>
<article-title>Binning metagenomic contigs by coverage and composition</article-title>
<source>Nat Methods</source>
<year>2014</year>
<volume>11</volume>
<fpage>1144</fpage>
<lpage>6</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.3103</pub-id>
<pub-id pub-id-type="pmid">25218180</pub-id>
</element-citation>
</ref>
<ref id="CR6">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bowers</surname>
<given-names>RM</given-names>
</name>
<name>
<surname>Clum</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Tice</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Lim</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Singh</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Ciobanu</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Ngan</surname>
<given-names>CY</given-names>
</name>
<name>
<surname>Cheng</surname>
<given-names>J-F</given-names>
</name>
<name>
<surname>Tringe</surname>
<given-names>SG</given-names>
</name>
<name>
<surname>Woyke</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Impact of library preparation protocols and template quantity on the metagenomic reconstruction of a mock microbial community</article-title>
<source>BMC Genomics</source>
<year>2015</year>
<volume>16</volume>
<issue>1</issue>
<fpage>856</fpage>
<pub-id pub-id-type="doi">10.1186/s12864-015-2063-6</pub-id>
<pub-id pub-id-type="pmid">26496746</pub-id>
</element-citation>
</ref>
<ref id="CR7">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sczyrba</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Hofmann</surname>
<given-names>P</given-names>
</name>
<name>
<surname>McHardy</surname>
<given-names>AC</given-names>
</name>
</person-group>
<article-title>Critical assessment of metagenome interpretation—a benchmark of metagenomics software</article-title>
<source>Nat Methods</source>
<year>2017</year>
<volume>14</volume>
<fpage>1063</fpage>
<lpage>71</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.4458</pub-id>
<pub-id pub-id-type="pmid">28967888</pub-id>
</element-citation>
</ref>
<ref id="CR8">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huson</surname>
<given-names>D. H.</given-names>
</name>
<name>
<surname>Auch</surname>
<given-names>A. F.</given-names>
</name>
<name>
<surname>Qi</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Schuster</surname>
<given-names>S. C.</given-names>
</name>
</person-group>
<article-title>MEGAN analysis of metagenomic data</article-title>
<source>Genome Research</source>
<year>2007</year>
<volume>17</volume>
<issue>3</issue>
<fpage>377</fpage>
<lpage>386</lpage>
<pub-id pub-id-type="doi">10.1101/gr.5969107</pub-id>
<pub-id pub-id-type="pmid">17255551</pub-id>
</element-citation>
</ref>
<ref id="CR9">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wood</surname>
<given-names>Derrick E</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>Steven L</given-names>
</name>
</person-group>
<article-title>Kraken: ultrafast metagenomic sequence classification using exact alignments</article-title>
<source>Genome Biology</source>
<year>2014</year>
<volume>15</volume>
<issue>3</issue>
<fpage>R46</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2014-15-3-r46</pub-id>
<pub-id pub-id-type="pmid">24580807</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ounit</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Wanamaker</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Close</surname>
<given-names>TJ</given-names>
</name>
<name>
<surname>Lonardi</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Clark: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers</article-title>
<source>BMC Genomics</source>
<year>2015</year>
<volume>16</volume>
<issue>1</issue>
<fpage>1</fpage>
<lpage>13</lpage>
<pub-id pub-id-type="doi">10.1186/s12864-015-1419-2</pub-id>
<pub-id pub-id-type="pmid">25553907</pub-id>
</element-citation>
</ref>
<ref id="CR11">
<label>11</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Qian</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Marchiori</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
</person-group>
<person-group person-group-type="editor">
<name>
<surname>In: Peixoto</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Silveira</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ali</surname>
<given-names>HH</given-names>
</name>
<name>
<surname>Maciel</surname>
<given-names>C</given-names>
</name>
<name>
<surname>van den Broek</surname>
<given-names>EL</given-names>
</name>
</person-group>
<article-title>Fast and sensitive classification of short metagenomic reads with skraken</article-title>
<source>Biomedical Engineering Systems and Technologies</source>
<year>2018</year>
<publisher-loc>Cham</publisher-loc>
<publisher-name>Springer</publisher-name>
</element-citation>
</ref>
<ref id="CR12">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Segata</surname>
<given-names>Nicola</given-names>
</name>
<name>
<surname>Waldron</surname>
<given-names>Levi</given-names>
</name>
<name>
<surname>Ballarini</surname>
<given-names>Annalisa</given-names>
</name>
<name>
<surname>Narasimhan</surname>
<given-names>Vagheesh</given-names>
</name>
<name>
<surname>Jousson</surname>
<given-names>Olivier</given-names>
</name>
<name>
<surname>Huttenhower</surname>
<given-names>Curtis</given-names>
</name>
</person-group>
<article-title>Metagenomic microbial community profiling using unique clade-specific marker genes</article-title>
<source>Nature Methods</source>
<year>2012</year>
<volume>9</volume>
<issue>8</issue>
<fpage>811</fpage>
<lpage>814</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.2066</pub-id>
<pub-id pub-id-type="pmid">22688413</pub-id>
</element-citation>
</ref>
<ref id="CR13">
<label>13</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Eisen</surname>
<given-names>JA</given-names>
</name>
</person-group>
<article-title>Environmental shotgun sequencing: its potential and challenges for studying the hidden world of microbes</article-title>
<source>PLoS Biol</source>
<year>2007</year>
<volume>5</volume>
<issue>3</issue>
<fpage>e82</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pbio.0050082</pub-id>
<pub-id pub-id-type="pmid">17355177</pub-id>
</element-citation>
</ref>
<ref id="CR14">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lindgreen</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Adair</surname>
<given-names>KL</given-names>
</name>
<name>
<surname>Gardner</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>An evaluation of the accuracy and speed of metagenome analysis tools</article-title>
<source>Sci Rep</source>
<year>2016</year>
<volume>6</volume>
<fpage>19233</fpage>
<pub-id pub-id-type="doi">10.1038/srep19233</pub-id>
<pub-id pub-id-type="pmid">26778510</pub-id>
</element-citation>
</ref>
<ref id="CR15">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Girotto</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Pizzi</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Higher recall in metagenomic sequence classification exploiting overlapping reads</article-title>
<source>BMC Genomics</source>
<year>2017</year>
<volume>18</volume>
<issue>10</issue>
<fpage>917</fpage>
<pub-id pub-id-type="doi">10.1186/s12864-017-4273-6</pub-id>
<pub-id pub-id-type="pmid">29244002</pub-id>
</element-citation>
</ref>
<ref id="CR16">
<label>16</label>
<mixed-citation publication-type="other">Kislyuk A. Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinformatics. 2009; 10. 10.1186/1471-2105-10-316.</mixed-citation>
</ref>
<ref id="CR17">
<label>17</label>
<mixed-citation publication-type="other">Kelley DR, Salzberg SL. Clustering metagenomic sequences with interpolated markov models. BMC Bioinformatics. 2010; 11. 10.1186/1471-2105-11-544.</mixed-citation>
</ref>
<ref id="CR18">
<label>18</label>
<mixed-citation publication-type="other">Strous M. The binning of metagenomic contigs for microbial physiology of mixed cultures. Front Microbiol. 2012; 3. 10.3389/fmicb.2012.00410.</mixed-citation>
</ref>
<ref id="CR19">
<label>19</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Leung</surname>
<given-names>HCM</given-names>
</name>
<name>
<surname>Yiu</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Peng</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Qin</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Chin</surname>
<given-names>FYL</given-names>
</name>
</person-group>
<article-title>A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>11</issue>
<fpage>1489</fpage>
<lpage>95</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr186</pub-id>
<pub-id pub-id-type="pmid">21493653</pub-id>
</element-citation>
</ref>
<ref id="CR20">
<label>20</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>Y-W</given-names>
</name>
<name>
<surname>Simmons</surname>
<given-names>BA</given-names>
</name>
<name>
<surname>Singer</surname>
<given-names>SW</given-names>
</name>
</person-group>
<article-title>Maxbin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets</article-title>
<source>Bioinformatics</source>
<year>2016</year>
<volume>32</volume>
<issue>4</issue>
<fpage>605</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btv638</pub-id>
<pub-id pub-id-type="pmid">26515820</pub-id>
</element-citation>
</ref>
<ref id="CR21">
<label>21</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Imelfort</surname>
<given-names>Michael</given-names>
</name>
<name>
<surname>Parks</surname>
<given-names>Donovan</given-names>
</name>
<name>
<surname>Woodcroft</surname>
<given-names>Ben J.</given-names>
</name>
<name>
<surname>Dennis</surname>
<given-names>Paul</given-names>
</name>
<name>
<surname>Hugenholtz</surname>
<given-names>Philip</given-names>
</name>
<name>
<surname>Tyson</surname>
<given-names>Gene W.</given-names>
</name>
</person-group>
<article-title>GroopM: an automated tool for the recovery of population genomes from related metagenomes</article-title>
<source>PeerJ</source>
<year>2014</year>
<volume>2</volume>
<fpage>e603</fpage>
<pub-id pub-id-type="doi">10.7717/peerj.603</pub-id>
<pub-id pub-id-type="pmid">25289188</pub-id>
</element-citation>
</ref>
<ref id="CR22">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kang</surname>
<given-names>DD</given-names>
</name>
<name>
<surname>Froula</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Egan</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Z</given-names>
</name>
</person-group>
<article-title>Metabat, an efficient tool for accurately reconstructing single genomes from complex microbial communities</article-title>
<source>PeerJ</source>
<year>2015</year>
<volume>3</volume>
<fpage>1165</fpage>
<pub-id pub-id-type="doi">10.7717/peerj.1165</pub-id>
</element-citation>
</ref>
<ref id="CR23">
<label>23</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kantorovitz</surname>
<given-names>Miriam R.</given-names>
</name>
<name>
<surname>Robinson</surname>
<given-names>Gene E.</given-names>
</name>
<name>
<surname>Sinha</surname>
<given-names>Saurabh</given-names>
</name>
</person-group>
<article-title>A statistical method for alignment-free comparison of regulatory sequences</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<issue>13</issue>
<fpage>i249</fpage>
<lpage>i255</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btm211</pub-id>
<pub-id pub-id-type="pmid">17646303</pub-id>
</element-citation>
</ref>
<ref id="CR24">
<label>24</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sims</surname>
<given-names>Gregory E.</given-names>
</name>
<name>
<surname>Jun</surname>
<given-names>Se-Ran</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>Guohong A.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>Sung-Hou</given-names>
</name>
</person-group>
<article-title>Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions</article-title>
<source>Proceedings of the National Academy of Sciences</source>
<year>2009</year>
<volume>106</volume>
<issue>8</issue>
<fpage>2677</fpage>
<lpage>2682</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.0813249106</pub-id>
</element-citation>
</ref>
<ref id="CR25">
<label>25</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Antonello</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Fast entropic profiler: An information theoretic approach for the discovery of patterns in genomes</article-title>
<source>IEEE/ACM Trans Comput Biol Bioinformatics</source>
<year>2014</year>
<volume>11</volume>
<issue>3</issue>
<fpage>500</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1109/TCBB.2013.2297924</pub-id>
</element-citation>
</ref>
<ref id="CR26">
<label>26</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Verzotto</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Beyond fixed-resolution alignment-free measures for mammalian enhancers sequence comparison</article-title>
<source>IEEE/ACM Trans Comput Biol Bioinforma</source>
<year>2014</year>
<volume>11</volume>
<issue>4</issue>
<fpage>628</fpage>
<lpage>37</lpage>
<pub-id pub-id-type="doi">10.1109/TCBB.2014.2306830</pub-id>
</element-citation>
</ref>
<ref id="CR27">
<label>27</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Leoni</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Schimd</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Clustering of reads with alignment-free measures and quality values</article-title>
<source>Algoritm Mol Biol</source>
<year>2015</year>
<volume>10</volume>
<issue>1</issue>
<fpage>4</fpage>
<pub-id pub-id-type="doi">10.1186/s13015-014-0029-x</pub-id>
</element-citation>
</ref>
<ref id="CR28">
<label>28</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Antonello</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>On the comparison of regulatory sequences with multiple resolution entropic profiles</article-title>
<source>BMC Bioinformatics</source>
<year>2016</year>
<volume>17</volume>
<issue>1</issue>
<fpage>130</fpage>
<pub-id pub-id-type="doi">10.1186/s12859-016-0980-2</pub-id>
<pub-id pub-id-type="pmid">26987840</pub-id>
</element-citation>
</ref>
<ref id="CR29">
<label>29</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lippert</surname>
<given-names>RA</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Waterman</surname>
<given-names>MS</given-names>
</name>
</person-group>
<article-title>Distributional regimes for the number of k-word matches between two random sequences</article-title>
<source>PNAS</source>
<year>2002</year>
<volume>99</volume>
<issue>22</issue>
<fpage>13980</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.202468099</pub-id>
<pub-id pub-id-type="pmid">12374863</pub-id>
</element-citation>
</ref>
<ref id="CR30">
<label>30</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>G</surname>
<given-names>R</given-names>
</name>
<name>
<surname>D</surname>
<given-names>C</given-names>
</name>
<name>
<surname>F</surname>
<given-names>S</given-names>
</name>
<name>
<surname>MS</surname>
<given-names>W</given-names>
</name>
</person-group>
<article-title>Alignment-free sequence comparison (i): statistics and power</article-title>
<source>J Comput Biol</source>
<year>2009</year>
<volume>16</volume>
<issue>12</issue>
<fpage>1615</fpage>
<lpage>34</lpage>
<pub-id pub-id-type="doi">10.1089/cmb.2009.0198</pub-id>
<pub-id pub-id-type="pmid">20001252</pub-id>
</element-citation>
</ref>
<ref id="CR31">
<label>31</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Régnier</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>A unified approach to word occurrence probabilities</article-title>
<source>Discret Appl Math</source>
<year>2000</year>
<volume>104</volume>
<issue>1</issue>
<fpage>259</fpage>
<lpage>80</lpage>
<pub-id pub-id-type="doi">10.1016/S0166-218X(00)00195-5</pub-id>
</element-citation>
</ref>
<ref id="CR32">
<label>32</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Song</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Reinert</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Deng</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Waterman</surname>
<given-names>MS</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing</article-title>
<source>Brief Bioinforma</source>
<year>2014</year>
<volume>15</volume>
<issue>3</issue>
<fpage>343</fpage>
<lpage>53</lpage>
<pub-id pub-id-type="doi">10.1093/bib/bbt067</pub-id>
</element-citation>
</ref>
<ref id="CR33">
<label>33</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Kaufmann</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Rousseeuw</surname>
<given-names>P</given-names>
</name>
</person-group>
<person-group person-group-type="editor">
<name>
<surname>Dodge</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>Clustering by means of medoids</article-title>
<source>Data Analysis based on the L1-Norm and Related Methods</source>
<year>1987</year>
<publisher-loc>North-Holland</publisher-loc>
<publisher-name>Elsevier</publisher-name>
</element-citation>
</ref>
<ref id="CR34">
<label>34</label>
<mixed-citation publication-type="other">Chen M. Super fast and terse kmeans clustering. 2017.
<ext-link ext-link-type="uri" xlink:href="https://nl.mathworks.com/matlabcentral/fileexchange/24616-kmeans-clustering">https://nl.mathworks.com/matlabcentral/fileexchange/24616-kmeans-clustering</ext-link>
.</mixed-citation>
</ref>
<ref id="CR35">
<label>35</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Consortium</surname>
<given-names>HMP</given-names>
</name>
</person-group>
<article-title>Structure, function and diversity of the healthy human microbiome</article-title>
<source>Nature</source>
<year>2012</year>
<volume>486</volume>
<issue>7402</issue>
<fpage>207</fpage>
<lpage>14</lpage>
<pub-id pub-id-type="doi">10.1038/nature11234</pub-id>
<pub-id pub-id-type="pmid">22699609</pub-id>
</element-citation>
</ref>
<ref id="CR36">
<label>36</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Boisvert</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Raymond</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Godzaridis</surname>
<given-names>É</given-names>
</name>
<name>
<surname>Laviolette</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Corbeil</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Ray meta: scalable de novo metagenome assembly and profiling</article-title>
<source>Genome Biol</source>
<year>2012</year>
<volume>13</volume>
<issue>12</issue>
<fpage>122</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2012-13-12-r122</pub-id>
</element-citation>
</ref>
<ref id="CR37">
<label>37</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sharon</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Morowitz</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Thomas</surname>
<given-names>BC</given-names>
</name>
<name>
<surname>Costello</surname>
<given-names>EK</given-names>
</name>
<name>
<surname>Relman</surname>
<given-names>DA</given-names>
</name>
<name>
<surname>Banfield</surname>
<given-names>JF</given-names>
</name>
</person-group>
<article-title>Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization</article-title>
<source>Genome Res</source>
<year>2013</year>
<volume>23</volume>
<issue>1</issue>
<fpage>111</fpage>
<lpage>20</lpage>
<pub-id pub-id-type="doi">10.1101/gr.142315.112</pub-id>
<pub-id pub-id-type="pmid">22936250</pub-id>
</element-citation>
</ref>
<ref id="CR38">
<label>38</label>
<mixed-citation publication-type="other">Ijaz
<italic>et al</italic>
A. Taxaassign v4.0. 2013.
<ext-link ext-link-type="uri" xlink:href="http://github.com/umerijaz/taxaassign">http://github.com/umerijaz/taxaassign</ext-link>
.</mixed-citation>
</ref>
<ref id="CR39">
<label>39</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vinh</surname>
<given-names>LV</given-names>
</name>
<name>
<surname>Lang</surname>
<given-names>TV</given-names>
</name>
<name>
<surname>Binh</surname>
<given-names>LT</given-names>
</name>
<name>
<surname>Hoai</surname>
<given-names>TV</given-names>
</name>
</person-group>
<article-title>A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads</article-title>
<source>Algoritm Mol Biol</source>
<year>2015</year>
<volume>10</volume>
<issue>1</issue>
<fpage>1</fpage>
<lpage>12</lpage>
<pub-id pub-id-type="doi">10.1186/s13015-014-0028-y</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000286  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000286  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021