Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 0003020 ( Pmc/Corpus ); précédent : 0003019; suivant : 0003021 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Estimating the total genome length of a metagenomic sample using k-mers</title>
<author>
<name sortKey="Hua, Kui" sort="Hua, Kui" uniqKey="Hua K" first="Kui" last="Hua">Kui Hua</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 0369 313X</institution-id>
<institution-id institution-id-type="GRID">grid.419897.a</institution-id>
<institution>MOE Key Laboratory of Bioinformatics Division and Center for Synthetic & System Biology, BNRIST,</institution>
</institution-wrap>
Beijing, 100084 China</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0662 3178</institution-id>
<institution-id institution-id-type="GRID">grid.12527.33</institution-id>
<institution>Department of Automation, Tsinghua University,</institution>
</institution-wrap>
Beijing, 100084 China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Zhang, Xuegong" sort="Zhang, Xuegong" uniqKey="Zhang X" first="Xuegong" last="Zhang">Xuegong Zhang</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 0369 313X</institution-id>
<institution-id institution-id-type="GRID">grid.419897.a</institution-id>
<institution>MOE Key Laboratory of Bioinformatics Division and Center for Synthetic & System Biology, BNRIST,</institution>
</institution-wrap>
Beijing, 100084 China</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0662 3178</institution-id>
<institution-id institution-id-type="GRID">grid.12527.33</institution-id>
<institution>Department of Automation, Tsinghua University,</institution>
</institution-wrap>
Beijing, 100084 China</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0662 3178</institution-id>
<institution-id institution-id-type="GRID">grid.12527.33</institution-id>
<institution>School of Life Sciences, Tsinghua University,</institution>
</institution-wrap>
Beijing, 100084 China</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">30967110</idno>
<idno type="pmc">6456951</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6456951</idno>
<idno type="RBID">PMC:6456951</idno>
<idno type="doi">10.1186/s12864-019-5467-x</idno>
<date when="2019">2019</date>
<idno type="wicri:Area/Pmc/Corpus">000302</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000302</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Estimating the total genome length of a metagenomic sample using k-mers</title>
<author>
<name sortKey="Hua, Kui" sort="Hua, Kui" uniqKey="Hua K" first="Kui" last="Hua">Kui Hua</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 0369 313X</institution-id>
<institution-id institution-id-type="GRID">grid.419897.a</institution-id>
<institution>MOE Key Laboratory of Bioinformatics Division and Center for Synthetic & System Biology, BNRIST,</institution>
</institution-wrap>
Beijing, 100084 China</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0662 3178</institution-id>
<institution-id institution-id-type="GRID">grid.12527.33</institution-id>
<institution>Department of Automation, Tsinghua University,</institution>
</institution-wrap>
Beijing, 100084 China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Zhang, Xuegong" sort="Zhang, Xuegong" uniqKey="Zhang X" first="Xuegong" last="Zhang">Xuegong Zhang</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 0369 313X</institution-id>
<institution-id institution-id-type="GRID">grid.419897.a</institution-id>
<institution>MOE Key Laboratory of Bioinformatics Division and Center for Synthetic & System Biology, BNRIST,</institution>
</institution-wrap>
Beijing, 100084 China</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0662 3178</institution-id>
<institution-id institution-id-type="GRID">grid.12527.33</institution-id>
<institution>Department of Automation, Tsinghua University,</institution>
</institution-wrap>
Beijing, 100084 China</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0662 3178</institution-id>
<institution-id institution-id-type="GRID">grid.12527.33</institution-id>
<institution>School of Life Sciences, Tsinghua University,</institution>
</institution-wrap>
Beijing, 100084 China</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Genomics</title>
<idno type="eISSN">1471-2164</idno>
<imprint>
<date when="2019">2019</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>Metagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify the component genomes in the community. This task is challenging due to the complexity of microbiome composition, limited availability of known reference genomes, and usually insufficient sequencing coverage.</p>
</sec>
<sec>
<title>Results</title>
<p>As an initial step toward understanding the complete composition of a metagenomic sample, we studied the problem of estimating the total length of all distinct component genomes in a metagenomic sample. We showed that this problem can be solved by estimating the total number of distinct k-mers in all the metagenomic sequencing data. We proposed a method for this estimation based on the sequencing coverage distribution of observed k-mers, and introduced a k-mer redundancy index (
<italic>KRI</italic>
) to fill in the gap between the count of distinct k-mers and the total genome length. We showed the effectiveness of the proposed method on a set of carefully designed simulation data corresponding to multiple situations of true metagenomic data. Results on real data indicate that the uncaptured genomic information can vary dramatically across metagenomic samples, with the potential to mislead downstream analyses.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>We proposed the question of how long the total genome length of all different species in a microbial community is and introduced a method to answer it.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (10.1186/s12864-019-5467-x) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Gordon, Ji" uniqKey="Gordon J">JI Gordon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Falony, G" uniqKey="Falony G">G Falony</name>
</author>
<author>
<name sortKey="Wijmenga, C" uniqKey="Wijmenga C">C Wijmenga</name>
</author>
<author>
<name sortKey="Raes, J" uniqKey="Raes J">J Raes</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhernakova, A" uniqKey="Zhernakova A">A Zhernakova</name>
</author>
<author>
<name sortKey="Wijmenga, C" uniqKey="Wijmenga C">C Wijmenga</name>
</author>
<author>
<name sortKey="Fu, J" uniqKey="Fu J">J Fu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cui, H" uniqKey="Cui H">H Cui</name>
</author>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
<author>
<name sortKey="Zhang, X" uniqKey="Zhang X">X Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, X" uniqKey="Zhang X">X Zhang</name>
</author>
<author>
<name sortKey="Liu, S" uniqKey="Liu S">S Liu</name>
</author>
<author>
<name sortKey="Cui, H" uniqKey="Cui H">H Cui</name>
</author>
<author>
<name sortKey="Chen, T" uniqKey="Chen T">T Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rodriguez, Rl" uniqKey="Rodriguez R">RL Rodriguez</name>
</author>
<author>
<name sortKey="Konstantinidis, Kt" uniqKey="Konstantinidis K">KT Konstantinidis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lander, Es" uniqKey="Lander E">ES Lander</name>
</author>
<author>
<name sortKey="Waterman, Ms" uniqKey="Waterman M">MS Waterman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hooper, Sd" uniqKey="Hooper S">SD Hooper</name>
</author>
<author>
<name sortKey="Dalevi, D" uniqKey="Dalevi D">D Dalevi</name>
</author>
<author>
<name sortKey="Pati, A" uniqKey="Pati A">A Pati</name>
</author>
<author>
<name sortKey="Mavromatis, K" uniqKey="Mavromatis K">K Mavromatis</name>
</author>
<author>
<name sortKey="Ivanova, Nn" uniqKey="Ivanova N">NN Ivanova</name>
</author>
<author>
<name sortKey="Kyrpides, Nc" uniqKey="Kyrpides N">NC Kyrpides</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Daley, T" uniqKey="Daley T">T Daley</name>
</author>
<author>
<name sortKey="Smith, Ad" uniqKey="Smith A">AD Smith</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rodriguez, Rl" uniqKey="Rodriguez R">RL Rodriguez</name>
</author>
<author>
<name sortKey="Konstantinidis, Kt" uniqKey="Konstantinidis K">KT Konstantinidis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tamames, J" uniqKey="Tamames J">J Tamames</name>
</author>
<author>
<name sortKey="De La Pena, S" uniqKey="De La Pena S">S de la Pena</name>
</author>
<author>
<name sortKey="De Lorenzo, V" uniqKey="De Lorenzo V">V de Lorenzo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wendl, Mc" uniqKey="Wendl M">MC Wendl</name>
</author>
<author>
<name sortKey="Kota, K" uniqKey="Kota K">K Kota</name>
</author>
<author>
<name sortKey="Weinstock, Gm" uniqKey="Weinstock G">GM Weinstock</name>
</author>
<author>
<name sortKey="Mitreva, M" uniqKey="Mitreva M">M Mitreva</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Segata, N" uniqKey="Segata N">N Segata</name>
</author>
<author>
<name sortKey="Waldron, L" uniqKey="Waldron L">L Waldron</name>
</author>
<author>
<name sortKey="Ballarini, A" uniqKey="Ballarini A">A Ballarini</name>
</author>
<author>
<name sortKey="Narasimhan, V" uniqKey="Narasimhan V">V Narasimhan</name>
</author>
<author>
<name sortKey="Jousson, O" uniqKey="Jousson O">O Jousson</name>
</author>
<author>
<name sortKey="Huttenhower, C" uniqKey="Huttenhower C">C Huttenhower</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Oh, J" uniqKey="Oh J">J Oh</name>
</author>
<author>
<name sortKey="Byrd, Al" uniqKey="Byrd A">AL Byrd</name>
</author>
<author>
<name sortKey="Deming, C" uniqKey="Deming C">C Deming</name>
</author>
<author>
<name sortKey="Conlan, S" uniqKey="Conlan S">S Conlan</name>
</author>
<author>
<name sortKey="Program, Ncs" uniqKey="Program N">NCS Program</name>
</author>
<author>
<name sortKey="Kong, Hh" uniqKey="Kong H">HH Kong</name>
</author>
<author>
<name sortKey="Segre, Ja" uniqKey="Segre J">JA Segre</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bankevich, A" uniqKey="Bankevich A">A Bankevich</name>
</author>
<author>
<name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Barbour, Ad" uniqKey="Barbour A">AD Barbour</name>
</author>
<author>
<name sortKey="Chen, Lhy" uniqKey="Chen L">LHY Chen</name>
</author>
<author>
<name sortKey="Loh, Wl" uniqKey="Loh W">WL Loh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Daley, T" uniqKey="Daley T">T Daley</name>
</author>
<author>
<name sortKey="Smith, Ad" uniqKey="Smith A">AD Smith</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Golub, Gh" uniqKey="Golub G">GH Golub</name>
</author>
<author>
<name sortKey="Welsch, Jh" uniqKey="Welsch J">JH Welsch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Truong, Dt" uniqKey="Truong D">DT Truong</name>
</author>
<author>
<name sortKey="Franzosa, Ea" uniqKey="Franzosa E">EA Franzosa</name>
</author>
<author>
<name sortKey="Tickle, Tl" uniqKey="Tickle T">TL Tickle</name>
</author>
<author>
<name sortKey="Scholz, M" uniqKey="Scholz M">M Scholz</name>
</author>
<author>
<name sortKey="Weingart, G" uniqKey="Weingart G">G Weingart</name>
</author>
<author>
<name sortKey="Pasolli, E" uniqKey="Pasolli E">E Pasolli</name>
</author>
<author>
<name sortKey="Tett, A" uniqKey="Tett A">A Tett</name>
</author>
<author>
<name sortKey="Huttenhower, C" uniqKey="Huttenhower C">C Huttenhower</name>
</author>
<author>
<name sortKey="Segata, N" uniqKey="Segata N">N Segata</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Freitas, Tak" uniqKey="Freitas T">TAK Freitas</name>
</author>
<author>
<name sortKey="Li, P E" uniqKey="Li P">P-E Li</name>
</author>
<author>
<name sortKey="Scholz, Mb" uniqKey="Scholz M">MB Scholz</name>
</author>
<author>
<name sortKey="Chain, Ps" uniqKey="Chain P">PS Chain</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marinier, E" uniqKey="Marinier E">E Marinier</name>
</author>
<author>
<name sortKey="Brown, Dg" uniqKey="Brown D">DG Brown</name>
</author>
<author>
<name sortKey="Mcconkey, Bj" uniqKey="Mcconkey B">BJ McConkey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G Marcais</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pruitt, Kd" uniqKey="Pruitt K">KD Pruitt</name>
</author>
<author>
<name sortKey="Tatusova, T" uniqKey="Tatusova T">T Tatusova</name>
</author>
<author>
<name sortKey="Maglott, Dr" uniqKey="Maglott D">DR Maglott</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mavromatis, K" uniqKey="Mavromatis K">K Mavromatis</name>
</author>
<author>
<name sortKey="Hugenholtz, P" uniqKey="Hugenholtz P">P Hugenholtz</name>
</author>
<author>
<name sortKey="Kyrpides, Nc" uniqKey="Kyrpides N">NC Kyrpides</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, S" uniqKey="Liu S">S Liu</name>
</author>
<author>
<name sortKey="Hua, K" uniqKey="Hua K">K Hua</name>
</author>
<author>
<name sortKey="Chen, S" uniqKey="Chen S">S Chen</name>
</author>
<author>
<name sortKey="Zhang, X" uniqKey="Zhang X">X Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Turnbaugh, Pj" uniqKey="Turnbaugh P">PJ Turnbaugh</name>
</author>
<author>
<name sortKey="Ley, Re" uniqKey="Ley R">RE Ley</name>
</author>
<author>
<name sortKey="Hamady, M" uniqKey="Hamady M">M Hamady</name>
</author>
<author>
<name sortKey="Fraser Liggett, Cm" uniqKey="Fraser Liggett C">CM Fraser-Liggett</name>
</author>
<author>
<name sortKey="Knight, R" uniqKey="Knight R">R Knight</name>
</author>
<author>
<name sortKey="Gordon, Ji" uniqKey="Gordon J">JI Gordon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Qin, J" uniqKey="Qin J">J Qin</name>
</author>
<author>
<name sortKey="Kristiansen, K" uniqKey="Kristiansen K">K Kristiansen</name>
</author>
<author>
<name sortKey="Wang, J" uniqKey="Wang J">J Wang</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Genomics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Genomics</journal-id>
<journal-title-group>
<journal-title>BMC Genomics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2164</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">30967110</article-id>
<article-id pub-id-type="pmc">6456951</article-id>
<article-id pub-id-type="publisher-id">5467</article-id>
<article-id pub-id-type="doi">10.1186/s12864-019-5467-x</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Estimating the total genome length of a metagenomic sample using k-mers</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Hua</surname>
<given-names>Kui</given-names>
</name>
<address>
<email>huak14@mails.tsinghua.edu.cn</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Zhang</surname>
<given-names>Xuegong</given-names>
</name>
<address>
<email>zhangxg@tsinghua.edu.cn</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
<xref ref-type="aff" rid="Aff2">2</xref>
<xref ref-type="aff" rid="Aff3">3</xref>
</contrib>
<aff id="Aff1">
<label>1</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 0369 313X</institution-id>
<institution-id institution-id-type="GRID">grid.419897.a</institution-id>
<institution>MOE Key Laboratory of Bioinformatics Division and Center for Synthetic & System Biology, BNRIST,</institution>
</institution-wrap>
Beijing, 100084 China</aff>
<aff id="Aff2">
<label>2</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0662 3178</institution-id>
<institution-id institution-id-type="GRID">grid.12527.33</institution-id>
<institution>Department of Automation, Tsinghua University,</institution>
</institution-wrap>
Beijing, 100084 China</aff>
<aff id="Aff3">
<label>3</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0662 3178</institution-id>
<institution-id institution-id-type="GRID">grid.12527.33</institution-id>
<institution>School of Life Sciences, Tsinghua University,</institution>
</institution-wrap>
Beijing, 100084 China</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>4</day>
<month>4</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>4</day>
<month>4</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="collection">
<year>2019</year>
</pub-date>
<volume>20</volume>
<issue>Suppl 2</issue>
<issue-sponsor>Publication of this supplement has not been supported by sponsorship. Information about the source of funding for publication charges can be found in the individual articles. The articles have undergone the journal's standard peer review process for supplements. The Supplement Editors declare that they have no competing interests.</issue-sponsor>
<elocation-id>183</elocation-id>
<permissions>
<copyright-statement>© The Author(s) 2019</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p>Metagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify the component genomes in the community. This task is challenging due to the complexity of microbiome composition, limited availability of known reference genomes, and usually insufficient sequencing coverage.</p>
</sec>
<sec>
<title>Results</title>
<p>As an initial step toward understanding the complete composition of a metagenomic sample, we studied the problem of estimating the total length of all distinct component genomes in a metagenomic sample. We showed that this problem can be solved by estimating the total number of distinct k-mers in all the metagenomic sequencing data. We proposed a method for this estimation based on the sequencing coverage distribution of observed k-mers, and introduced a k-mer redundancy index (
<italic>KRI</italic>
) to fill in the gap between the count of distinct k-mers and the total genome length. We showed the effectiveness of the proposed method on a set of carefully designed simulation data corresponding to multiple situations of true metagenomic data. Results on real data indicate that the uncaptured genomic information can vary dramatically across metagenomic samples, with the potential to mislead downstream analyses.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>We proposed the question of how long the total genome length of all different species in a microbial community is and introduced a method to answer it.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (10.1186/s12864-019-5467-x) contains supplementary material, which is available to authorized users.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Metagenomics</kwd>
<kwd>Sequencing coverage</kwd>
<kwd>Distinct k-mers</kwd>
<kwd>Genome length</kwd>
</kwd-group>
<conference xlink:href="http://glab.hzau.edu.cn/APBC2019/">
<conf-name>The 17th Asia Pacific Bioinformatics Conference (APBC 2019)</conf-name>
<conf-acronym>APBC 2019</conf-acronym>
<conf-loc>Wuhan, China</conf-loc>
<conf-date>14-16 January 2019</conf-date>
</conference>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2019</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Background</title>
<p>It is now widely known that microbiomes or the ecological community of microbes living at a certain site of the human host such as the gut can play important roles in human health [
<xref ref-type="bibr" rid="CR1">1</xref>
<xref ref-type="bibr" rid="CR5">5</xref>
]. Metagenomic sequencing is a powerful technology for studying the microbiome by sequencing DNAs from all the genomes of its component microbes [
<xref ref-type="bibr" rid="CR5">5</xref>
]. Since it is impossible to capture the full components of a microbiome, a ‘metagenomic sample’ is actually a subset of the target metagenome captured with the sequencing process, as a sample from a population in statistics [
<xref ref-type="bibr" rid="CR6">6</xref>
]. The basic task of a metagenomic study is to read out the underlying information about the microbiome from the metagenomic sample.</p>
<p>For any genomic sequencing study, a fundamental property we need to consider is the sequencing coverage, which is the fraction of genomic materials that has been captured and sequenced. This, however, has been largely ignored in metagenomic studies [
<xref ref-type="bibr" rid="CR6">6</xref>
]. The level of coverage of a metagenomic sample is of key importance for recovering the information about the microbiome. Variations caused by coverage differences between metagenomic samples can be wrongly attributed to biological reasons, resulting in misleading conclusions [
<xref ref-type="bibr" rid="CR6">6</xref>
].</p>
<p>The question of estimating the coverage of a sequencing sample has been attracting researchers’ attention since the beginning of human genome project. In 1988, Eric S. Lander and Michael S. Waterman introduced the famous Lander-Waterman theory to show how well a genome can be recovered for a certain sequencing strategy [
<xref ref-type="bibr" rid="CR7">7</xref>
]. It had played a key role in guiding the design and completion of the human genome project. Lander-Waterman theory was specially designed for single genomic sequencing projects. It is no longer suitable for most metagenomic data since the relative abundances of component genomes in a microbiome are very uneven and therefore the sequencing procedure violates the uniform distribution assumption [
<xref ref-type="bibr" rid="CR8">8</xref>
]. This is also true for other types of sequencing projects like RNA-sequencing or ChIP-seq where distributions of components to be sequenced are uneven. Methods were therefore introduced to estimate the coverage or solve similar problems in such situations [
<xref ref-type="bibr" rid="CR8">8</xref>
<xref ref-type="bibr" rid="CR12">12</xref>
]. For example, Hooper et al. proposed a method to estimate the total number of genomic bins in a metagenome by assuming certain abundance distribution of the microbial composition [
<xref ref-type="bibr" rid="CR8">8</xref>
]. Rodriguez et al. assessed the abundance-weighted coverage of a metagenomic sample by examining the redundancy among individual reads [
<xref ref-type="bibr" rid="CR10">10</xref>
]. Daley and Smith introduced an empirical Bayesian method to predict the number of previously un-sequenced molecules that would be observed if additional reads were provided [
<xref ref-type="bibr" rid="CR9">9</xref>
]. This method has been demonstrated powerful in different kinds of sequencing data such as ChIP-seq data and RNA-seq data, but its effectiveness on metagenomic data has not been studied.</p>
<p>For the genomic sequences that have been captured in a metagenomic sample, the basic information we want to get is what types of microbes are there at what abundances. This is referred to as taxonomy profiling. A straightforward way of taxonomy profiling is to map sequencing reads to reference genomes in known databases. Known microbial genomes only represent a small proportion of existing microbes. Even for the type of well-studied communities like human gut, it’s typical that around 30%–60% of sequencing reads in a metagenomic sample could not be mapped to any known microbial genomes [
<xref ref-type="bibr" rid="CR13">13</xref>
]. Furthermore, it has been observed that the fraction of unmapped reads can vary dramatically across different samples in the same study, say, ranging surprising from 2 to 96% [
<xref ref-type="bibr" rid="CR14">14</xref>
]. This type of between-samples variation is lost when relative abundances are calculated based on mapped reads. Ignoring such loss of information can be misleading in downstream analyses [
<xref ref-type="bibr" rid="CR5">5</xref>
].</p>
<p>Mainly because of the incomplete coverage and the existence of unmapped reads, the genomes that can be profiled from a metagenomic sample are only a part of all genomes that exist in the microbiome. It is therefore desirable to make estimations on the genomes that have been missed. Even if it is not possible to make accurate estimations on the number of missed genomes and their relative abundances, any educated guess about any properties of missed genomes can provide useful information for the comparison of samples based on known genomes. In this paper, we study the problem of estimating the total length of all distinct genomes in a metagenomic sample. If we can estimate this with reasonable accuracy, we will know a lot about the missed genomes by subtracting those known and mapped genomes from the total. This is the same question as estimating the actual coverage of the unknown targeting whole microbiome by the observed sequencing reads in the metagenomic sample. In preparation of this manuscript, a similar question has been studied in [
<xref ref-type="bibr" rid="CR15">15</xref>
], but the method requires both long reads and short reads. For most cases where only short reads are available, we found that this question can be solved by solving the related question of estimating the number of distinct k-mers in the metagenome if we have infinite sequencing depth. A statistical model is introduced to predict the number of distinct k-mers in a metagenome that have not been included in the observed data. And we define a k-mer redundancy index (
<italic>KRI</italic>
) that helps to estimate the total genome length from total distinct k-mer count. Since the underlying truth is unknown in any real metagenomic data, we simulated a set of synthetic metagenomic datasets for different situations of microbial composition. Experiments on these data showed that the proposed method works well.</p>
</sec>
<sec id="Sec2">
<title>Methods</title>
<sec id="Sec3">
<title>Problem statements</title>
<p>The problem we study is to estimate the total length of distinct genomes in a microbiome based on the metagenomic sequencing data. A more accurate statement of this problem in practice depends on the criteria for two genomes to be identified as distinct from each other. This is a complicated taxonomic question considering the wide existence of strains and sub-strains within each microbial species. To focus on the key mathematic problem behind the question, we simply assume that genomes from the same species are same while genomes from different species are distinct. We will give further discussion about this later in the “
<xref rid="Sec7" ref-type="sec">Estimating
<italic>KRI</italic>
of the distinct genome set</xref>
” section.</p>
</sec>
<sec id="Sec4">
<title>Understanding DNA sequence as a collection of k-mers</title>
<p>A DNA sequence can be viewed as a collection of k-mers by breaking the sequence into nucleotide substrings of length k, as illustrated in Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
a. From the k-mer perspective, we define total k-mer count (
<italic>TKC</italic>
), distinct k-mer count (
<italic>DKC</italic>
) and k-mer redundancy index (
<italic>KRI</italic>
) as three properties of a sequence.
<italic>TKC</italic>
is the number of all k-mers obtained when breaking a sequence into k-mers.
<italic>DKC</italic>
is the amount of distinct k-mers, i.e., the amount of remaining k-mers after removing all replicates of k-mers.
<italic>KRI</italic>
is defined as the ratio of
<italic>TKC</italic>
and
<italic>DKC</italic>
, which reflects the degree of repetition of k-mers in the sequence. The values of these three properties depend on the target sequence and the selection of k-mer size (
<italic>k</italic>
). For a given
<italic>k</italic>
, any of the three properties can be obtained if the other two are provided. For example,
<italic>TKC</italic>
=
<italic>DKC</italic>
<italic>KRI</italic>
, which means
<italic>TKC</italic>
is achievable if we know
<italic>DKC</italic>
and
<italic>KRI</italic>
of a k-mer collection. Obviously, for a sequence of length
<italic>L</italic>
,
<italic>TKC</italic>
=
<italic>L</italic>
<italic>k</italic>
+1, indicating that
<italic>TKC</italic>
can be roughly taken as the sequence length if
<italic>L</italic>
<italic>k</italic>
, which is satisfied when studying genomes using small k-mers. These simple mathematical relations form the basic idea of our work.
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>Overview of the proposed method.
<bold>a</bold>
An illustration of understanding DNA sequence as a collection of k-mers. In this simple case, sequence length
<italic>L</italic>
=12,
<italic>k</italic>
=6 for the k-mer counting,
<italic>TKC</italic>
=
<italic>L</italic>
<italic>k</italic>
+1=7,
<italic>DKC</italic>
=5,
<italic>KRI</italic>
=
<italic>TKC</italic>
/
<italic>DKC</italic>
=1.2.
<bold>b</bold>
Relationships between metagenome, metagenomic sample and the set of distinct genomes in the metagenome.
<bold>c</bold>
Workflow of the proposed method</p>
</caption>
<graphic xlink:href="12864_2019_5467_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
<p>Similarly, a set of sequences can also be treated as a collection of k-mers by breaking every single sequence into k-mers. Therefore, a metagenomic sample, the metagenome and the set of distinct genomes in a metagenome can all be viewed as a collection of k-mers, respectively, as illustrated in Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
b.</p>
</sec>
<sec id="Sec5">
<title>Overview of our solution</title>
<p>From the k-mer perspective, our aim of estimating total genome length of all distinct genomes in a metagenome is equivalent to estimating
<italic>TKC</italic>
of the set of distinct genomes (Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
b). Since it is impossible to count
<italic>TKC</italic>
of the true metagenome from the metagenomic sample due to finite sequencing coverage and unknown genome composition, we predict
<italic>TKC</italic>
of the distinct genome set by estimating its
<italic>DKC</italic>
and
<italic>KRI</italic>
separately (Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
c). A metagenome and the corresponding set of distinct genomes of all its components differ only in genome abundances, they share the same distinct k-mers and have equal
<italic>DKC</italic>
s. We estimate
<italic>DKC</italic>
of the metagenome from the observed metagenomic data by modeling the sequencing event as a Poisson sampling procedure.
<italic>KRI</italic>
of the distinct genome set can be estimated based on known genomes detected in the metagenomic sample. Finally, the total genome length, which is roughly equal to
<italic>TKC</italic>
, can be achieved simply by taking the product of
<italic>KRI</italic>
and
<italic>DKC</italic>
.</p>
</sec>
<sec id="Sec6">
<title>Predicting
<italic>DKC</italic>
of the metagenome</title>
<p>A metagenomic sample can be viewed as a subset of the metagenome obtained by random sampling, as illustrated in Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
b.
<italic>DKC</italic>
of a metagenomic sample can be readily obtained by counting k-mers in the sequences, either from the original sequencing reads or from the assembled scaffolds. We need to estimate the number of k-mers in the metagenome that have not been covered in the metagenomic sample. The frequency that a given k-mer
<italic>i</italic>
is sequenced, denoted as
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
, can be modeled as a Poisson distribution with an unknown parameter
<italic>λ</italic>
<sub>
<italic>i</italic>
</sub>
. The probability that k-mer
<italic>i</italic>
will not been sequenced is
<inline-formula id="IEq1">
<alternatives>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$e^{-\lambda _{i}}\phantom {\dot {i}\!}$\end{document}</tex-math>
<mml:math id="M2">
<mml:msup>
<mml:mrow>
<mml:mi>e</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>λ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msup>
</mml:math>
<inline-graphic xlink:href="12864_2019_5467_Article_IEq1.gif"></inline-graphic>
</alternatives>
</inline-formula>
. We call these k-mers as uncaptured k-mers. Although the frequencies of k-mers overlapping with each other are dependent, such limited dependence can be well-approximated by assuming independence [
<xref ref-type="bibr" rid="CR16">16</xref>
,
<xref ref-type="bibr" rid="CR17">17</xref>
]. Therefore, we further assume that
<italic>λ</italic>
<sub>
<italic>i</italic>
</sub>
independently and identically follow some unknown distribution
<italic>μ</italic>
(
<italic>λ</italic>
), the number of uncaptured k-mers is
<disp-formula id="Equ1">
<label>1</label>
<alternatives>
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{@{}rcl@{}} N\int \limits_{0}^{\infty} e^{-\lambda} \mathrm d\mu(\lambda) \end{array} $$ \end{document}</tex-math>
<mml:math id="M4">
<mml:mtable class="eqnarray" columnalign="left center right">
<mml:mtr>
<mml:mtd class="eqnarray-1">
<mml:mi>N</mml:mi>
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msup>
<mml:mrow>
<mml:mi>e</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>λ</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mi mathvariant="normal">d</mml:mi>
<mml:mi>μ</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>λ</mml:mi>
<mml:mo>)</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<graphic xlink:href="12864_2019_5467_Article_Equ1.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>where
<italic>N</italic>
is the
<italic>DKC</italic>
of the metagenome. Since both
<italic>N</italic>
and
<italic>μ</italic>
(
<italic>λ</italic>
) are unknown, we are not able to calculate the value of (1) directly. Fortunately, the frequencies of captured k-mers in the metagenomic sample also contain information about
<italic>N</italic>
and
<italic>μ</italic>
(
<italic>λ</italic>
), which would help us to estimate the value of (1). Let
<italic>n</italic>
<sub>
<italic>j</italic>
</sub>
denote the number of k-mers that appear
<italic>j</italic>
times in the metagenomic sample. The expectation of
<italic>n</italic>
<sub>
<italic>j</italic>
</sub>
can be written as
<disp-formula id="Equ2">
<label>2</label>
<alternatives>
<tex-math id="M5">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{@{}rcl@{}} E(n_{j}) = N\int \limits_{0}^{\infty} e^{-\lambda}\lambda^{j}/j! \mathrm d\mu(\lambda) \end{array} $$ \end{document}</tex-math>
<mml:math id="M6">
<mml:mtable class="eqnarray" columnalign="left center right">
<mml:mtr>
<mml:mtd class="eqnarray-1">
<mml:mi>E</mml:mi>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>N</mml:mi>
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msup>
<mml:mrow>
<mml:mi>e</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>λ</mml:mi>
</mml:mrow>
</mml:msup>
<mml:msup>
<mml:mrow>
<mml:mi>λ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>/</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>!</mml:mo>
<mml:mi mathvariant="normal">d</mml:mi>
<mml:mi>μ</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>λ</mml:mi>
<mml:mo>)</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<graphic xlink:href="12864_2019_5467_Article_Equ2.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>If we take the observation
<italic>n</italic>
<sub>
<italic>j</italic>
</sub>
as its expectation
<italic>E</italic>
(
<italic>n</italic>
<sub>
<italic>j</italic>
</sub>
), the mathematical problem of estimating the number of uncaptured k-mers can be formulated as:</p>
<p>
<bold>Given observations</bold>
<italic>n</italic>
<sub>1</sub>
,
<italic>n</italic>
<sub>2</sub>
,
<italic>n</italic>
<sub>3</sub>
,…,
<italic>n</italic>
<sub>
<italic>M</italic>
</sub>
,
<bold>which follow the formula</bold>
<disp-formula id="Equa">
<alternatives>
<tex-math id="M7">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ {n_{j}= N \int \limits_{0}^{\infty} e^{-\lambda}\lambda^{j}/j! \mathrm{d} \mu(\lambda)} $$ \end{document}</tex-math>
<mml:math id="M8">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi>N</mml:mi>
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msup>
<mml:mrow>
<mml:mi>e</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>λ</mml:mi>
</mml:mrow>
</mml:msup>
<mml:msup>
<mml:mrow>
<mml:mi>λ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>/</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>!</mml:mo>
<mml:mi mathvariant="normal">d</mml:mi>
<mml:mi>μ</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>λ</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="12864_2019_5467_Article_Equa.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>
<bold>where</bold>
<italic>N</italic>
<bold>and</bold>
<italic>μ</italic>
(
<italic>λ</italic>
)
<bold>are unknown. Find the value of</bold>
<disp-formula id="Equb">
<alternatives>
<tex-math id="M9">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $${N\int \limits_{0}^{\infty} e^{-\lambda}\mathrm d\mu(\lambda)} $$ \end{document}</tex-math>
<mml:math id="M10">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msup>
<mml:mrow>
<mml:mi>e</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>λ</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mi mathvariant="normal">d</mml:mi>
<mml:mi>μ</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>λ</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="12864_2019_5467_Article_Equb.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>To solve this mathematical problem, let
<italic>ω</italic>
(
<italic>λ</italic>
)=
<italic>N</italic>
<italic>λ</italic>
<italic>e</italic>
<sup>
<italic>λ</italic>
</sup>
,
<italic>m</italic>
<sub>
<italic>i</italic>
</sub>
=(
<italic>i</italic>
+1)!
<italic>n</italic>
<sub>
<italic>i</italic>
+1</sub>
, the problem can be re-written as</p>
<p>
<bold>
<italic>Given observations</italic>
</bold>
<italic>m</italic>
<sub>0</sub>
,
<italic>m</italic>
<sub>1</sub>
,
<italic>m</italic>
<sub>2</sub>
,…,
<italic>m</italic>
<sub>
<italic>M</italic>
−1</sub>
,
<bold>
<italic>which follow the formula</italic>
</bold>
<disp-formula id="Equc">
<alternatives>
<tex-math id="M11">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $${m_{j}= \int \limits_{0}^{\infty} \lambda^{j}\omega(\lambda)\mathrm{d}\mu(\lambda)} $$ \end{document}</tex-math>
<mml:math id="M12">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msup>
<mml:mrow>
<mml:mi>λ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mi>ω</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>λ</mml:mi>
<mml:mo>)</mml:mo>
<mml:mi mathvariant="normal">d</mml:mi>
<mml:mi>μ</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>λ</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="12864_2019_5467_Article_Equc.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>
<bold>
<italic>where</italic>
</bold>
<italic>ω</italic>
(
<italic>λ</italic>
)
<bold>
<italic>and</italic>
</bold>
<italic>μ</italic>
(
<italic>λ</italic>
)
<bold>
<italic>are unknown. Find the value of</italic>
</bold>
<disp-formula id="Equd">
<alternatives>
<tex-math id="M13">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $${\int \limits_{0}^{\infty} \frac{1}{\lambda} \omega(\lambda) \mathrm d\mu(\lambda)} $$ \end{document}</tex-math>
<mml:math id="M14">
<mml:mrow>
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>λ</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mi>ω</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>λ</mml:mi>
<mml:mo>)</mml:mo>
<mml:mi mathvariant="normal">d</mml:mi>
<mml:mi>μ</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>λ</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="12864_2019_5467_Article_Equd.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
This is a special type of Gaussian quadrature problem that can be solved using the Golub-Welsch algorithm [
<xref ref-type="bibr" rid="CR9">9</xref>
,
<xref ref-type="bibr" rid="CR18">18</xref>
]. The final estimation of (
<xref rid="Equ1" ref-type="">1</xref>
) can be written as
<disp-formula id="Equ3">
<label>3</label>
<alternatives>
<tex-math id="M15">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{@{}rcl@{}} N\int \limits_{0}^{\infty} e^{-\lambda}\mathrm{d}\mu(\lambda) \approx \sum_{i=1}^{M} \frac{\alpha_{i}}{\lambda_{i}} \end{array} $$ \end{document}</tex-math>
<mml:math id="M16">
<mml:mtable class="eqnarray" columnalign="left center right">
<mml:mtr>
<mml:mtd class="eqnarray-1">
<mml:mi>N</mml:mi>
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msup>
<mml:mrow>
<mml:mi>e</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>λ</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mi mathvariant="normal">d</mml:mi>
<mml:mi>μ</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>λ</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo></mml:mo>
<mml:munderover>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>M</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>λ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<graphic xlink:href="12864_2019_5467_Article_Equ3.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>where
<italic>α</italic>
<sub>
<italic>i</italic>
</sub>
and
<italic>λ</italic>
<sub>
<italic>i</italic>
</sub>
are decided by the Golub-Welsch algorithm taking
<italic>m</italic>
<sub>0</sub>
,
<italic>m</italic>
<sub>1</sub>
,
<italic>m</italic>
<sub>2</sub>
,…,
<italic>m</italic>
<sub>
<italic>M</italic>
−1</sub>
as the input.
<italic>DKC</italic>
of the metagenome is finally achieved by adding this estimated uncaptured number of k-mers to
<italic>DKC</italic>
of the metagenomic sample. The variability and reliability of the estimation can be reflected by the confidence interval achieved by the bootstrap method.</p>
</sec>
<sec id="Sec7">
<title>Estimating
<italic>KRI</italic>
of the distinct genome set</title>
<p>To precisely estimate
<italic>KRI</italic>
of the set of distinct genomes of a metagenome, one needs to know all different genomes in the metagenome, which is usually unachievable due the existence of many unknown microbes. To deal with this problem, we reasoned that
<italic>KRI</italic>
of a genome set can be well estimated use only part of the genomes in it. Therefore, we can use known genomes detected in a metagenomic sample to estimate the
<italic>KRI</italic>
of the whole distinct genome set. In practice, we first apply MetaPhlan2 [
<xref ref-type="bibr" rid="CR19">19</xref>
] and GOTTCHA [
<xref ref-type="bibr" rid="CR20">20</xref>
] on the metagenomic data to identify known species in the metagenome. For each detected species, we select one of its reference genomes from the database [
<xref ref-type="bibr" rid="CR9">9</xref>
] to form a genome set. An alternative way to form the genome set is to take the assembled scaffolds as detected genomes. We estimated the
<italic>KRI</italic>
of this set of detected genomes as the
<italic>KRI</italic>
of the distinct genome set.</p>
<p>The way of selecting known genomes to estimate
<italic>KRI</italic>
actually decides the criteria of identifying distinct genomes in our work. Since we select only one genome for each detected species to estimate the
<italic>KRI</italic>
of the set of distinct genomes, the estimation is restricted to species level, even if two strains of the same species were detected in the metagenomic sample. If we include genomes for all detected strains in the
<italic>KRI</italic>
estimation, the estimation will be at strain level.</p>
</sec>
<sec id="Sec8">
<title>Implementation of the method</title>
<p>We first adopt Pollux [
<xref ref-type="bibr" rid="CR21">21</xref>
] to correct the sequencing error in the metagenomic samples. Counting all k-mers in a metagenomic sample can be computationally heavy. We employ jellyfish2 [
<xref ref-type="bibr" rid="CR22">22</xref>
], one of the fastest k-mer counting approaches, for the k-mer counting step. We use the Golub-Welsch algorithm implemented in preseq [
<xref ref-type="bibr" rid="CR9">9</xref>
,
<xref ref-type="bibr" rid="CR17">17</xref>
] to estimate the distinct k-mer count. MetaPlan2 [
<xref ref-type="bibr" rid="CR19">19</xref>
] and GOTTCHA [
<xref ref-type="bibr" rid="CR20">20</xref>
] are used to identify the known species from the metagenomic sample. Genomes for those known species are selected from existing database [
<xref ref-type="bibr" rid="CR23">23</xref>
] to estimate the
<italic>KRI</italic>
for the whole community.</p>
</sec>
<sec id="Sec9">
<title>Simulated metagenomic datasets</title>
<p>Due to the complexity of real-world microbiome compositions, it is hard, if possible, to find real metagenomic data that have complete true answer of all components. To test the performance of our method, we simulated several microbial communities of different situations and generate synthetic metagenomic samples. We simulated communities with 10 species and 50 species as representatives of a simple case and a more complicated case. We used three types of composition abundance distributions to form microbial communities of low, medium and high complexities (LC, MC and HC) following the way of a previous simulation study [
<xref ref-type="bibr" rid="CR24">24</xref>
]. LC, MC and HC are defined based on the number of dominant microbe who has a high relative abundance. LC has only one dominant microbe. MC has two or more dominant species. HC has no dominant species. The fraction of information captured by the metagenomic data is of key importance for estimating the total genome length. To reflect this property of a metagenomic sample, we define initial coverage as the fraction of distinct k-mers in the set of distinct genomes of the target community included in the sequencing data. For each community, metagenomic samples of different reads numbers were generated to simulate the situation of different sequencing depths and the initial coverages of the community. To check how robust the method is to random effect, we use three random seeds to generate samples for the same parameters. In total, 225 metagenomic samples with 10 species and 243 samples with 50 species were generated with an in-house simulation tool [
<xref ref-type="bibr" rid="CR25">25</xref>
]. Beside the error-free samples, we also generated a set of metagenomic samples with sequencing errors for each community.</p>
<p>We did some simple simulations to show that
<italic>KRI</italic>
of a genome set can be estimated using part of all genomes. We simulated four metagenomes with 10, 50, 100 and 200 species, respectively. For each metagenome, we randomly select 60% of its component genomes as known ones to estimate the
<italic>KRI</italic>
of the whole metagenome. Although in real world, the known microbes are not randomly selected from the nature, the order in which they were known has nothing to do with their sequence contents. Therefore, we believe such random selection is reasonable.</p>
</sec>
<sec id="Sec10">
<title>Real metagenomic datasets</title>
<p>We select two datasets to conduct our method on. One dataset contains 65 oral metagenomic samples from Human Microbiome Project (HMP) [
<xref ref-type="bibr" rid="CR26">26</xref>
] and the other consists of 145 human gut metagnomic samples, including 71 from normal people and 74 from type 2 diabetes patients [
<xref ref-type="bibr" rid="CR27">27</xref>
].</p>
</sec>
</sec>
<sec id="Sec11" sec-type="results">
<title>Results</title>
<sec id="Sec12">
<title>Results on simulated metagenomic datasets</title>
<p>We tested our method on all synthetic metagenomic samples. Fig. 
<xref rid="Fig2" ref-type="fig">2</xref>
shows how well the number of distinct k-mers (
<italic>DKC</italic>
) in a community can be estimated from a metagenomic sample. The whole figure contains two parts, showing results for communities with 10 species and 50 species, respectively. Each part consists of three panels, displayed from left to right. Further explanations about each panel are given in the figure caption. As expected, the overall prediction in samples with 10 species is better than in samples with 50 species. Communities with high complexity achieve best prediction accuracy among those three kinds of abundance distributions. This agrees with the intuition that the more even the abundance distribution is, the better the prediction will be. The performances on communities with medium complexity are the worst. This is because the two dominant species make up more than 70% of the community, which means that most of the reads are sequenced from them. Since less than 30% of the reads come from the rest of all species, only a small part of information about their genomes is reflected in the sequencing data, leading to the bad performance, especially when sequencing depth is low. We also show how the performance goes when the initial coverage increases. The performance is measured by relative error, defined as the difference between estimated value and the true value divided by the true value. In general, the performance gets better as the initial coverage increases. Another interesting observation is that, for most cases, Golub-Welsch algorithm gives a good estimation which trends to be no larger than the ground truth, and the corresponding bootstrap confidence interval is usually small. For the exaggerated estimations, Golub-Welsch algorithm is more likely to give a large bootstrap confidence interval. Therefore, Golub-Welsch algorithm provides a reliable estimation of the lower bound of DKC, as suggested in preseq [
<xref ref-type="bibr" rid="CR9">9</xref>
].
<fig id="Fig2">
<label>Fig. 2</label>
<caption>
<p>Different microbial communities are simulated to test the performance of the proposed method. (
<bold>a</bold>
) Results for microbial communities with 10 species. The three histograms on the left show the abundance distributions of different simulated communities. The middle panel shows the estimation results of distinct k-mer count. Each bar represents an estimation result based on a synthetic metagenomic sample and the error bar shows the 95% bootstrap confidence interval of the estimation. The black dash line is the true distinct k-mer count. The right panel shows how the relative error goes as the initial coverage increases (
<italic>k</italic>
= 20). (
<bold>b</bold>
) The same as (
<bold>a</bold>
) except that the species number is 50. (Note that some of the samples with 10 species are not shown in the barplot, see Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S1 for all samples with 10 species)</p>
</caption>
<graphic xlink:href="12864_2019_5467_Fig2_HTML" id="MO2"></graphic>
</fig>
</p>
</sec>
<sec id="Sec13">
<title>Effects of K and sequencing errors</title>
<p>To see how the parameter k affects the results, We chose different k to do the estimation for a simulated metagenomic sample (50 species, high complexity, 25 million reads). Results show that the estimation is robust to the selection of k (Fig. 
<xref rid="Fig3" ref-type="fig">3</xref>
c).
<fig id="Fig3">
<label>Fig. 3</label>
<caption>
<p>
<bold>a</bold>
Performance on metagenomic data with sequencing errors.
<bold>b</bold>
True and estimated K-mer Redundant Index (KRI) in different metagenomics communities. About 60% of the species are randomly chosen as the known species to estimate the KRI of all species.
<bold>c</bold>
Results of different selections of K. Simulated metagenomic sample with 50 speices and high complexity of the abundance distribution was used.
<bold>d</bold>
Results on HMP Tongue Dorsum datasets</p>
</caption>
<graphic xlink:href="12864_2019_5467_Fig3_HTML" id="MO3"></graphic>
</fig>
</p>
<p>Despite the good performance on error-free sequencing data, the Golub-Welsch algorithm can given bad prediction when the sequencing data contains errors (Fig. 
<xref rid="Fig3" ref-type="fig">3</xref>
a). Sequencing errors introduce novel k-mers that should not exist in the data. A higher fraction of low-count k-mers will be considered by the algorithm as the implication of more low-abundant microbes. Therefore, sequencing errors lead to exaggerated estimation of total distinct k-mers and this exaggeration grows as the sequencing depths increases (Fig. 
<xref rid="Fig3" ref-type="fig">3</xref>
a, green bars). To solve this problem, we use Pollux [
<xref ref-type="bibr" rid="CR21">21</xref>
] to correct the sequencing error before counting k-mers. Results on simulation data show that the performance can be under control after correcting the sequencing errors (Fig. 
<xref rid="Fig3" ref-type="fig">3</xref>
a, blue bars).</p>
</sec>
<sec id="Sec14">
<title>Comparison between different methods</title>
<p>Besides Golub-Welsch algorithm, we also applied the major algorithm rational function approximation (RFA) in preseq on the simulated metagenomic samples with 50 species (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S2) and compared its performance with Golub-Welsch algorithm. Both methods achieve a good performance and each present their own strength (Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
: Figure S3). RFA outperforms Golub-Welsch algorithm in the median complexity communities (two species with a total relative abundance higher than 70%), indicating a stronger ability of extrapolation. For communities with high complexity or low complexity, Golub-Welsch algorithm makes stable and accurate results with only few exceptions. RFA also gives a good result, but with a slight trend to exaggerate the estimation.</p>
</sec>
<sec id="Sec15">
<title>Estimating KRI using known species</title>
<p>There’s a gap between distinct k-mer count (
<italic>DKC</italic>
) and total genome length or
<italic>TKC</italic>
. We use
<italic>KRI</italic>
to bridge this gap as introduced above. For simulated metagenomic samples, GOTTCHA succesfully identified most species therefore led to a perfect estimation of KRI. We did some simple simulations to show that
<italic>KRI</italic>
of a genome set can be estimated using part of all genomes. In general,
<italic>KRI</italic>
of the community increases as there are more species in the community, as shown in Fig. 
<xref rid="Fig3" ref-type="fig">3</xref>
b. The result shows that
<italic>KRI</italic>
of a community can be well estimated use only part of the species, which demonstrates the feasibility of estimating
<italic>KRI</italic>
of a community based only on known species.</p>
</sec>
<sec id="Sec16">
<title>Results on real metagenomic datasets</title>
<p>We applied our method on the two selected datasets (Figs. 
<xref rid="Fig3" ref-type="fig">3</xref>
d and
<xref rid="Fig4" ref-type="fig">4</xref>
). One general observation in the results is that, the number of uncaptured k-mers can differ a lot between samples, even when the observed k-mer counts are similar (Figs. 
<xref rid="Fig3" ref-type="fig">3</xref>
d and
<xref rid="Fig4" ref-type="fig">4</xref>
a). Further comparison between normal samples and T2D samples shows that the predicted distinct k-mer counts present significant difference while observed k-mer counts do not (Fig. 
<xref rid="Fig4" ref-type="fig">4</xref>
c and d). In the original study, it was reported that the difference of within-sample diversity (entropy of gene abundance) between normal group and T2D group is not significant [
<xref ref-type="bibr" rid="CR27">27</xref>
]. Since the gene abundances were calculated based only on extracted sequence data, chances are that the significance had been masked by ignoring the difference in the ’unseen’ information.
<fig id="Fig4">
<label>Fig. 4</label>
<caption>
<p>Results on T2D metagenomic datasets.
<bold>a</bold>
Observed and estimated k-mer count.
<bold>b</bold>
Histogram and density of the observed distinct k-mer count.
<bold>c</bold>
Histogram and density of the predicted distinct k-mer count</p>
</caption>
<graphic xlink:href="12864_2019_5467_Fig4_HTML" id="MO4"></graphic>
</fig>
</p>
</sec>
</sec>
<sec id="Sec17">
<title>Conclusion and discussion</title>
<p>In this paper, we proposed the question of ‘how long the total genome length of all different species in a microbial community is’ and introduced a method to answer it. This is an important step toward the estimation of unknown and unseen component genomes in a microbiome. We invented a k-mer-based strategy to liberate the reliance on the limited microbial reference genomes so that unknown species can be included in the estimation. To explore the information that has not been directly captured in the metagenomic sample, we developed a statistical method to estimate the number of uncaptured k-mers. Distinct k-mer count was multiplied by the k-mer redundancy index (
<italic>KRI</italic>
), an index defined to reflect the repetition of k-mers and estimated from known species, to get the total genome length. Performance on the simulation data shows that the proposed method works well, and the precision of the estimation is mainly affected by factors including the sequencing error, the initial coverage of the community and the complexity of the microbial diversity.</p>
<p>Extracting information from the metagenomic data is the foundation of downstream analysis. The complex nature of microbial community and inadequate microbial diversity represented in existing databases make it challenging to extract the full information. A metagenomic sample can capture only part of the information about the microbial community due to its complexity, among which only part can be extracted due to the limited known references. Ignoring these ‘uncaptured’ and ‘unknown’ information can mislead downstream analyses. In the work of estimating total genome length, we adopted the reference-free strategy to include the ‘unknown’ information and a statistical model was employed to estimate the ‘uncaptured’ part so that the completeness of the extracted information can be pursued to the maximum. The experiments on simulated data showed the feasibility of the proposed method and results on real datasets revealed that downstream analyses may be biased if ’unseen’ information is ignored. Further studies are needed in the future to explore ways by which the estimated total metagenome length can help to better extracting information about unknown or uncaptured species from the metagenomic data and comparing metagenome samples.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Additional file</title>
<sec id="Sec18">
<p>
<supplementary-material content-type="local-data" id="MOESM1">
<media xlink:href="12864_2019_5467_MOESM1_ESM.pdf">
<label>Additional file 1</label>
<caption>
<p>This file contains
<bold>Figure S1</bold>
<bold>Figure S3</bold>
. (PDFk 6194 kb)</p>
</caption>
</media>
</supplementary-material>
</p>
</sec>
</sec>
</body>
<back>
<glossary>
<title>Abbreviations</title>
<def-list>
<def-item>
<term>DKC</term>
<def>
<p>Distinct k-mer count</p>
</def>
</def-item>
<def-item>
<term>KRI</term>
<def>
<p>K-mer redundancy index</p>
</def>
</def-item>
<def-item>
<term>TKC</term>
<def>
<p>Total k-mer count</p>
</def>
</def-item>
</def-list>
</glossary>
<ack>
<title>Acknowledgements</title>
<p>Not applicable.</p>
<sec id="d29e1587">
<title>Funding</title>
<p>The publication of this work was sponsored by the National Natural Science Foundation of China (61673231 and 61721003).</p>
</sec>
<sec id="d29e1592" sec-type="data-availability">
<title>Availability of data and materials</title>
<p>K-mer count tables for all simulated datasets and real datasets can be found at
<ext-link ext-link-type="uri" xlink:href="https://github.com/stevenhuakui/Total-genome-length-data">https://github.com/stevenhuakui/Total-genome-length-data</ext-link>
.</p>
</sec>
<sec id="d29e1602">
<title>About this supplement</title>
<p>This article has been published as part of
<italic>BMC Genomics Volume 20 Supplement 2, 2019: Selected articles from the 17th Asia Pacific Bioinformatics Conference (APBC 2019): genomics</italic>
. The full contents of the supplement are available online at
<ext-link ext-link-type="uri" xlink:href="https://bmcgenomics.biomedcentral.com/articles/supplements/volume-20-supplement-2">https://bmcgenomics.biomedcentral.com/articles/supplements/volume-20-supplement-2</ext-link>
.;</p>
</sec>
</ack>
<notes notes-type="author-contribution">
<title>Authors’ contributions</title>
<p>KH conceived the study, developed methodology, performed data analysis and wrote the manuscript. XZ conceived the study and wrote the manuscript. Both authors have read and approved the final manuscript.</p>
</notes>
<notes notes-type="COI-statement">
<sec>
<title>Ethics approval and consent to participate</title>
<p>Not applicable.</p>
</sec>
<sec>
<title>Consent for publication</title>
<p>Not applicable.</p>
</sec>
<sec>
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec>
<title>Publisher’s Note</title>
<p>Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.</p>
</sec>
</notes>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gordon</surname>
<given-names>JI</given-names>
</name>
</person-group>
<article-title>Honor thy gut symbionts redux</article-title>
<source>Science</source>
<year>2012</year>
<volume>336</volume>
<issue>6086</issue>
<fpage>1251</fpage>
<lpage>3</lpage>
<pub-id pub-id-type="doi">10.1126/science.1224686</pub-id>
<pub-id pub-id-type="pmid">22674326</pub-id>
</element-citation>
</ref>
<ref id="CR2">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Falony</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Wijmenga</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Raes</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Population-level analysis of gut microbiome variation</article-title>
<source>Science</source>
<year>2016</year>
<volume>352</volume>
<issue>6285</issue>
<fpage>560</fpage>
<lpage>4</lpage>
<pub-id pub-id-type="doi">10.1126/science.aad3503</pub-id>
<pub-id pub-id-type="pmid">27126039</pub-id>
</element-citation>
</ref>
<ref id="CR3">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhernakova</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Wijmenga</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Fu</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity</article-title>
<source>Science</source>
<year>2016</year>
<volume>352</volume>
<issue>6285</issue>
<fpage>565</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1126/science.aad3369</pub-id>
<pub-id pub-id-type="pmid">27126040</pub-id>
</element-citation>
</ref>
<ref id="CR4">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cui</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X</given-names>
</name>
</person-group>
<article-title>An overview of major metagenomic studies on human microbiomes in health and disease</article-title>
<source>Quant Biol</source>
<year>2016</year>
<volume>4</volume>
<issue>3</issue>
<fpage>192</fpage>
<lpage>206</lpage>
<pub-id pub-id-type="doi">10.1007/s40484-016-0078-x</pub-id>
</element-citation>
</ref>
<ref id="CR5">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Cui</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Reading the underlying information from massive metagenomic sequencing data</article-title>
<source>Proc IEEE</source>
<year>2017</year>
<volume>105</volume>
<issue>3</issue>
<fpage>459</fpage>
<lpage>73</lpage>
</element-citation>
</ref>
<ref id="CR6">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rodriguez</surname>
<given-names>RL</given-names>
</name>
<name>
<surname>Konstantinidis</surname>
<given-names>KT</given-names>
</name>
</person-group>
<article-title>Estimating coverage in metagenomic data sets and why it matters</article-title>
<source>ISME J</source>
<year>2014</year>
<volume>8</volume>
<issue>11</issue>
<fpage>2349</fpage>
<lpage>51</lpage>
<pub-id pub-id-type="doi">10.1038/ismej.2014.76</pub-id>
<pub-id pub-id-type="pmid">24824669</pub-id>
</element-citation>
</ref>
<ref id="CR7">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lander</surname>
<given-names>ES</given-names>
</name>
<name>
<surname>Waterman</surname>
<given-names>MS</given-names>
</name>
</person-group>
<article-title>Genomic mapping by fingerprinting random clones: a mathematical analysis</article-title>
<source>Genomics</source>
<year>1988</year>
<volume>2</volume>
<issue>3</issue>
<fpage>231</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1016/0888-7543(88)90007-9</pub-id>
<pub-id pub-id-type="pmid">3294162</pub-id>
</element-citation>
</ref>
<ref id="CR8">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hooper</surname>
<given-names>SD</given-names>
</name>
<name>
<surname>Dalevi</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Pati</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Mavromatis</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Ivanova</surname>
<given-names>NN</given-names>
</name>
<name>
<surname>Kyrpides</surname>
<given-names>NC</given-names>
</name>
</person-group>
<article-title>Estimating dna coverage and abundance in metagenomes using a gamma approximation</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<issue>3</issue>
<fpage>295</fpage>
<lpage>301</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btp687</pub-id>
<pub-id pub-id-type="pmid">20008478</pub-id>
</element-citation>
</ref>
<ref id="CR9">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Daley</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Smith</surname>
<given-names>AD</given-names>
</name>
</person-group>
<article-title>Predicting the molecular complexity of sequencing libraries</article-title>
<source>Nat Methods</source>
<year>2013</year>
<volume>10</volume>
<issue>4</issue>
<fpage>325</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.2375</pub-id>
<pub-id pub-id-type="pmid">23435259</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rodriguez</surname>
<given-names>RL</given-names>
</name>
<name>
<surname>Konstantinidis</surname>
<given-names>KT</given-names>
</name>
</person-group>
<article-title>Nonpareil: a redundancy-based approach to assess the level of coverage in metagenomic datasets</article-title>
<source>Bioinformatics</source>
<year>2014</year>
<volume>30</volume>
<issue>5</issue>
<fpage>629</fpage>
<lpage>35</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt584</pub-id>
<pub-id pub-id-type="pmid">24123672</pub-id>
</element-citation>
</ref>
<ref id="CR11">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tamames</surname>
<given-names>J</given-names>
</name>
<name>
<surname>de la Pena</surname>
<given-names>S</given-names>
</name>
<name>
<surname>de Lorenzo</surname>
<given-names>V</given-names>
</name>
</person-group>
<article-title>Cover: a priori estimation of coverage for metagenomic sequencing</article-title>
<source>Environ Microbiol Rep</source>
<year>2012</year>
<volume>4</volume>
<issue>3</issue>
<fpage>335</fpage>
<lpage>41</lpage>
<pub-id pub-id-type="doi">10.1111/j.1758-2229.2012.00338.x</pub-id>
<pub-id pub-id-type="pmid">23760797</pub-id>
</element-citation>
</ref>
<ref id="CR12">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wendl</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Kota</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Weinstock</surname>
<given-names>GM</given-names>
</name>
<name>
<surname>Mitreva</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Coverage theories for metagenomic dna sequencing based on a generalization of stevens’ theorem</article-title>
<source>J Math Biol</source>
<year>2013</year>
<volume>67</volume>
<issue>5</issue>
<fpage>1141</fpage>
<lpage>61</lpage>
<pub-id pub-id-type="doi">10.1007/s00285-012-0586-x</pub-id>
<pub-id pub-id-type="pmid">22965653</pub-id>
</element-citation>
</ref>
<ref id="CR13">
<label>13</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Segata</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Waldron</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Ballarini</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Narasimhan</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Jousson</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Huttenhower</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Metagenomic microbial community profiling using unique clade-specific marker genes</article-title>
<source>Nat Methods</source>
<year>2012</year>
<volume>9</volume>
<issue>8</issue>
<fpage>811</fpage>
<lpage>4</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.2066</pub-id>
<pub-id pub-id-type="pmid">22688413</pub-id>
</element-citation>
</ref>
<ref id="CR14">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Oh</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Byrd</surname>
<given-names>AL</given-names>
</name>
<name>
<surname>Deming</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Conlan</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Program</surname>
<given-names>NCS</given-names>
</name>
<name>
<surname>Kong</surname>
<given-names>HH</given-names>
</name>
<name>
<surname>Segre</surname>
<given-names>JA</given-names>
</name>
</person-group>
<article-title>Biogeography and individuality shape function in the human skin metagenome</article-title>
<source>Nature</source>
<year>2014</year>
<volume>514</volume>
<issue>7520</issue>
<fpage>59</fpage>
<lpage>64</lpage>
<pub-id pub-id-type="doi">10.1038/nature13786</pub-id>
<pub-id pub-id-type="pmid">25279917</pub-id>
</element-citation>
</ref>
<ref id="CR15">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bankevich</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
</person-group>
<article-title>Joint analysis of long and short reads enables accurate estimates of microbiome complexity</article-title>
<source>Cell Syst</source>
<year>2018</year>
<volume>7</volume>
<issue>2</issue>
<fpage>192</fpage>
<lpage>200</lpage>
<pub-id pub-id-type="doi">10.1016/j.cels.2018.06.009</pub-id>
<pub-id pub-id-type="pmid">30056005</pub-id>
</element-citation>
</ref>
<ref id="CR16">
<label>16</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Barbour</surname>
<given-names>AD</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>LHY</given-names>
</name>
<name>
<surname>Loh</surname>
<given-names>WL</given-names>
</name>
</person-group>
<article-title>Compound poisson approximation for nonnegative random-variables via stein method</article-title>
<source>Ann Probab</source>
<year>1992</year>
<volume>20</volume>
<issue>4</issue>
<fpage>1843</fpage>
<lpage>66</lpage>
<pub-id pub-id-type="doi">10.1214/aop/1176989531</pub-id>
</element-citation>
</ref>
<ref id="CR17">
<label>17</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Daley</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Smith</surname>
<given-names>AD</given-names>
</name>
</person-group>
<article-title>Modeling genome coverage in single-cell sequencing</article-title>
<source>Bioinformatics</source>
<year>2014</year>
<volume>30</volume>
<issue>22</issue>
<fpage>3159</fpage>
<lpage>65</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btu540</pub-id>
<pub-id pub-id-type="pmid">25107873</pub-id>
</element-citation>
</ref>
<ref id="CR18">
<label>18</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Golub</surname>
<given-names>GH</given-names>
</name>
<name>
<surname>Welsch</surname>
<given-names>JH</given-names>
</name>
</person-group>
<article-title>Calculation of gauss quadrature rules</article-title>
<source>Math Comput</source>
<year>1969</year>
<volume>23</volume>
<issue>106</issue>
<fpage>221</fpage>
<lpage>30</lpage>
<pub-id pub-id-type="doi">10.1090/S0025-5718-69-99647-1</pub-id>
</element-citation>
</ref>
<ref id="CR19">
<label>19</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Truong</surname>
<given-names>DT</given-names>
</name>
<name>
<surname>Franzosa</surname>
<given-names>EA</given-names>
</name>
<name>
<surname>Tickle</surname>
<given-names>TL</given-names>
</name>
<name>
<surname>Scholz</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Weingart</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Pasolli</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Tett</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Huttenhower</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Segata</surname>
<given-names>N</given-names>
</name>
</person-group>
<article-title>Metaphlan2 for enhanced metagenomic taxonomic profiling</article-title>
<source>Nat Methods</source>
<year>2015</year>
<volume>12</volume>
<issue>10</issue>
<fpage>902</fpage>
<lpage>3</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.3589</pub-id>
<pub-id pub-id-type="pmid">26418763</pub-id>
</element-citation>
</ref>
<ref id="CR20">
<label>20</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Freitas</surname>
<given-names>TAK</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>P-E</given-names>
</name>
<name>
<surname>Scholz</surname>
<given-names>MB</given-names>
</name>
<name>
<surname>Chain</surname>
<given-names>PS</given-names>
</name>
</person-group>
<article-title>Accurate read-based metagenome characterization using a hierarchical suite of unique signatures</article-title>
<source>Nucleic Acids Res</source>
<year>2015</year>
<volume>43</volume>
<issue>10</issue>
<fpage>69</fpage>
<pub-id pub-id-type="doi">10.1093/nar/gkv180</pub-id>
</element-citation>
</ref>
<ref id="CR21">
<label>21</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marinier</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Brown</surname>
<given-names>DG</given-names>
</name>
<name>
<surname>McConkey</surname>
<given-names>BJ</given-names>
</name>
</person-group>
<article-title>Pollux: platform independent error correction of single and mixed genomes</article-title>
<source>BMC Bioinformatics</source>
<year>2015</year>
<volume>16</volume>
<fpage>10</fpage>
<pub-id pub-id-type="doi">10.1186/s12859-014-0435-6</pub-id>
<pub-id pub-id-type="pmid">25592313</pub-id>
</element-citation>
</ref>
<ref id="CR22">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marcais</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Kingsford</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>A fast, lock-free approach for efficient parallel counting of occurrences of k-mers</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>6</issue>
<fpage>764</fpage>
<lpage>70</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr011</pub-id>
<pub-id pub-id-type="pmid">21217122</pub-id>
</element-citation>
</ref>
<ref id="CR23">
<label>23</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pruitt</surname>
<given-names>KD</given-names>
</name>
<name>
<surname>Tatusova</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Maglott</surname>
<given-names>DR</given-names>
</name>
</person-group>
<article-title>Ncbi reference sequences (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins</article-title>
<source>Nucleic Acids Res</source>
<year>2007</year>
<volume>35</volume>
<issue>Database issue</issue>
<fpage>61</fpage>
<lpage>5</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkl842</pub-id>
</element-citation>
</ref>
<ref id="CR24">
<label>24</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mavromatis</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Hugenholtz</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Kyrpides</surname>
<given-names>NC</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Use of simulated data sets to evaluate the fidelity of metagenomic processing methods</article-title>
<source>Nat Methods</source>
<year>2007</year>
<volume>4</volume>
<issue>6</issue>
<fpage>495</fpage>
<lpage>500</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth1043</pub-id>
<pub-id pub-id-type="pmid">17468765</pub-id>
</element-citation>
</ref>
<ref id="CR25">
<label>25</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Hua</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X</given-names>
</name>
</person-group>
<article-title>Comprehensive simulation of metagenomic sequencing data with non-uniform sampling distribution</article-title>
<source>Quant Biol</source>
<year>2018</year>
<volume>6</volume>
<issue>2</issue>
<fpage>175</fpage>
<lpage>85</lpage>
<pub-id pub-id-type="doi">10.1007/s40484-018-0142-9</pub-id>
</element-citation>
</ref>
<ref id="CR26">
<label>26</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Turnbaugh</surname>
<given-names>PJ</given-names>
</name>
<name>
<surname>Ley</surname>
<given-names>RE</given-names>
</name>
<name>
<surname>Hamady</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Fraser-Liggett</surname>
<given-names>CM</given-names>
</name>
<name>
<surname>Knight</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Gordon</surname>
<given-names>JI</given-names>
</name>
</person-group>
<article-title>The human microbiome project</article-title>
<source>Nature</source>
<year>2007</year>
<volume>449</volume>
<issue>7164</issue>
<fpage>804</fpage>
<lpage>10</lpage>
<pub-id pub-id-type="doi">10.1038/nature06244</pub-id>
<pub-id pub-id-type="pmid">17943116</pub-id>
</element-citation>
</ref>
<ref id="CR27">
<label>27</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qin</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Kristiansen</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<article-title>A metagenome-wide association study of gut microbiota in type 2 diabetes</article-title>
<source>Nature</source>
<year>2012</year>
<volume>490</volume>
<issue>7418</issue>
<fpage>55</fpage>
<lpage>60</lpage>
<pub-id pub-id-type="doi">10.1038/nature11450</pub-id>
<pub-id pub-id-type="pmid">23023125</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 0003020 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 0003020 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021