CyberinfraV1, Pmc, Corpus, bibRecord, 000086

Statistical Approach of Functional Profiling for a Microbial Community

Identifieur interne : 000086 ( Pmc/Corpus ); précédent : 000085; suivant : 000087

Statistical Approach of Functional Profiling for a Microbial Community

Auteurs : Lingling An ; Nauromal Pookhao ; Hongmei Jiang ; Jiannong Xu

Source :

PLoS ONE [ 1932-6203 ] ; 2014.

RBID : PMC:4157783

Abstract

Background

Metagenomics is a relatively new but fast growing field within environmental biology and medical sciences. It enables researchers to understand the diversity of microbes, their functions, cooperation, and evolution in a particular ecosystem. Traditional methods in genomics and microbiology are not efficient in capturing the structure of the microbial community in an environment. Nowadays, high-throughput next-generation sequencing technologies are powerfully driving the metagenomic studies. However, there is an urgent need to develop efficient statistical methods and computational algorithms to rapidly analyze the massive metagenomic short sequencing data and to accurately detect the features/functions present in the microbial community. Although several issues about functions of metagenomes at pathways or subsystems level have been investigated, there is a lack of studies focusing on functional analysis at a low level of a hierarchical functional tree, such as SEED subsystem tree.

Results

A two-step statistical procedure (metaFunction) is proposed to detect all possible functional roles at the low level from a metagenomic sample/community. In the first step a statistical mixture model is proposed at the base of gene codons to estimate the abundances for the candidate functional roles, with sequencing error being considered. As a gene could be involved in multiple biological processes the functional assignment is therefore adjusted by utilizing an error distribution in the second step. The performance of the proposed procedure is evaluated through comprehensive simulation studies. Compared with other existing methods in metagenomic functional analysis the new approach is more accurate in assigning reads to functional roles, and therefore at more general levels. The method is also employed to analyze two real data sets.

Conclusions

metaFunction is a powerful tool in accurate profiling functions in a metagenomic sample.

Url:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4157783

DOI: 10.1371/journal.pone.0106588
PubMed: 25198674
PubMed Central: 4157783

Links to Exploration step

PMC:4157783

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Statistical Approach of Functional Profiling for a Microbial Community</title>
<author><name sortKey="An, Lingling" sort="An, Lingling" uniqKey="An L" first="Lingling" last="An">Lingling An</name>
<affiliation><nlm:aff id="aff1"><addr-line>Department of Agricultural & Biosystems Engineering, University of Arizona, Tucson, Arizona, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="aff2"><addr-line>Interdisciplinary Programs in Statistics, University of Arizona, Tucson, Arizona, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Pookhao, Nauromal" sort="Pookhao, Nauromal" uniqKey="Pookhao N" first="Nauromal" last="Pookhao">Nauromal Pookhao</name>
<affiliation><nlm:aff id="aff1"><addr-line>Department of Agricultural & Biosystems Engineering, University of Arizona, Tucson, Arizona, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Jiang, Hongmei" sort="Jiang, Hongmei" uniqKey="Jiang H" first="Hongmei" last="Jiang">Hongmei Jiang</name>
<affiliation><nlm:aff id="aff3"><addr-line>Department of Statistics, Northwestern University, Evanston, Illinois, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Xu, Jiannong" sort="Xu, Jiannong" uniqKey="Xu J" first="Jiannong" last="Xu">Jiannong Xu</name>
<affiliation><nlm:aff id="aff4"><addr-line>Department of Biology, New Mexico State University, Las Cruces, New Mexico, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">25198674</idno>
<idno type="pmc">4157783</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4157783</idno>
<idno type="RBID">PMC:4157783</idno>
<idno type="doi">10.1371/journal.pone.0106588</idno>
<date when="2014">2014</date>
<idno type="wicri:Area/Pmc/Corpus">000086</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Statistical Approach of Functional Profiling for a Microbial Community</title>
<author><name sortKey="An, Lingling" sort="An, Lingling" uniqKey="An L" first="Lingling" last="An">Lingling An</name>
<affiliation><nlm:aff id="aff1"><addr-line>Department of Agricultural & Biosystems Engineering, University of Arizona, Tucson, Arizona, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="aff2"><addr-line>Interdisciplinary Programs in Statistics, University of Arizona, Tucson, Arizona, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Pookhao, Nauromal" sort="Pookhao, Nauromal" uniqKey="Pookhao N" first="Nauromal" last="Pookhao">Nauromal Pookhao</name>
<affiliation><nlm:aff id="aff1"><addr-line>Department of Agricultural & Biosystems Engineering, University of Arizona, Tucson, Arizona, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Jiang, Hongmei" sort="Jiang, Hongmei" uniqKey="Jiang H" first="Hongmei" last="Jiang">Hongmei Jiang</name>
<affiliation><nlm:aff id="aff3"><addr-line>Department of Statistics, Northwestern University, Evanston, Illinois, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Xu, Jiannong" sort="Xu, Jiannong" uniqKey="Xu J" first="Jiannong" last="Xu">Jiannong Xu</name>
<affiliation><nlm:aff id="aff4"><addr-line>Department of Biology, New Mexico State University, Las Cruces, New Mexico, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">PLoS ONE</title>
<idno type="eISSN">1932-6203</idno>
<imprint><date when="2014">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><sec><title>Background</title>
<p>Metagenomics is a relatively new but fast growing field within environmental biology and medical sciences. It enables researchers to understand the diversity of microbes, their functions, cooperation, and evolution in a particular ecosystem. Traditional methods in genomics and microbiology are not efficient in capturing the structure of the microbial community in an environment. Nowadays, high-throughput next-generation sequencing technologies are powerfully driving the metagenomic studies. However, there is an urgent need to develop efficient statistical methods and computational algorithms to rapidly analyze the massive metagenomic short sequencing data and to accurately detect the features/functions present in the microbial community. Although several issues about functions of metagenomes at pathways or subsystems level have been investigated, there is a lack of studies focusing on functional analysis at a low level of a hierarchical functional tree, such as SEED subsystem tree.</p>
</sec>
<sec><title>Results</title>
<p>A two-step statistical procedure (metaFunction) is proposed to detect all possible functional roles at the low level from a metagenomic sample/community. In the first step a statistical mixture model is proposed at the base of gene codons to estimate the abundances for the candidate functional roles, with sequencing error being considered. As a gene could be involved in multiple biological processes the functional assignment is therefore adjusted by utilizing an error distribution in the second step. The performance of the proposed procedure is evaluated through comprehensive simulation studies. Compared with other existing methods in metagenomic functional analysis the new approach is more accurate in assigning reads to functional roles, and therefore at more general levels. The method is also employed to analyze two real data sets.</p>
</sec>
<sec><title>Conclusions</title>
<p>metaFunction is a powerful tool in accurate profiling functions in a metagenomic sample.</p>
</sec>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Huson, Dh" uniqKey="Huson D">DH Huson</name>
</author>
<author><name sortKey="Auch, Af" uniqKey="Auch A">AF Auch</name>
</author>
<author><name sortKey="Qi, J" uniqKey="Qi J">J Qi</name>
</author>
<author><name sortKey="Schuster, Sc" uniqKey="Schuster S">SC Schuster</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Rosen, Gl" uniqKey="Rosen G">GL Rosen</name>
</author>
<author><name sortKey="Sokhansanj, Ba" uniqKey="Sokhansanj B">BA Sokhansanj</name>
</author>
<author><name sortKey="Polikar, R" uniqKey="Polikar R">R Polikar</name>
</author>
<author><name sortKey="Bruns, Ma" uniqKey="Bruns M">MA Bruns</name>
</author>
<author><name sortKey="Russell, J" uniqKey="Russell J">J Russell</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Mardis, Er" uniqKey="Mardis E">ER Mardis</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Meinicke, P" uniqKey="Meinicke P">P Meinicke</name>
</author>
<author><name sortKey="Asshauer, Kp" uniqKey="Asshauer K">KP Asshauer</name>
</author>
<author><name sortKey="Lingner, T" uniqKey="Lingner T">T Lingner</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Huson, Dh" uniqKey="Huson D">DH Huson</name>
</author>
<author><name sortKey="Mitra, S" uniqKey="Mitra S">S Mitra</name>
</author>
<author><name sortKey="Ruscheweyh, Hj" uniqKey="Ruscheweyh H">HJ Ruscheweyh</name>
</author>
<author><name sortKey="Weber, N" uniqKey="Weber N">N Weber</name>
</author>
<author><name sortKey="Schuster, Sc" uniqKey="Schuster S">SC Schuster</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Xia, Lc" uniqKey="Xia L">LC Xia</name>
</author>
<author><name sortKey="Cram, Ja" uniqKey="Cram J">JA Cram</name>
</author>
<author><name sortKey="Chen, T" uniqKey="Chen T">T Chen</name>
</author>
<author><name sortKey="Fuhrman, Ja" uniqKey="Fuhrman J">JA Fuhrman</name>
</author>
<author><name sortKey="Sun, F" uniqKey="Sun F">F Sun</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Jiang, H" uniqKey="Jiang H">H Jiang</name>
</author>
<author><name sortKey="An, L" uniqKey="An L">L An</name>
</author>
<author><name sortKey="Lin, Sm" uniqKey="Lin S">SM Lin</name>
</author>
<author><name sortKey="Feng, G" uniqKey="Feng G">G Feng</name>
</author>
<author><name sortKey="Qiu, Y" uniqKey="Qiu Y">Y Qiu</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lindner, Ms" uniqKey="Lindner M">MS Lindner</name>
</author>
<author><name sortKey="Renard, By" uniqKey="Renard B">BY Renard</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Overbeek, R" uniqKey="Overbeek R">R Overbeek</name>
</author>
<author><name sortKey="Begley, T" uniqKey="Begley T">T Begley</name>
</author>
<author><name sortKey="Butler, Rm" uniqKey="Butler R">RM Butler</name>
</author>
<author><name sortKey="Choudhuri, Jv" uniqKey="Choudhuri J">JV Choudhuri</name>
</author>
<author><name sortKey="Chuang, Hy" uniqKey="Chuang H">HY Chuang</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Dinsdale, Ea" uniqKey="Dinsdale E">EA Dinsdale</name>
</author>
<author><name sortKey="Edwards, Ra" uniqKey="Edwards R">RA Edwards</name>
</author>
<author><name sortKey="Hall, D" uniqKey="Hall D">D Hall</name>
</author>
<author><name sortKey="Angly, F" uniqKey="Angly F">F Angly</name>
</author>
<author><name sortKey="Breitbart, M" uniqKey="Breitbart M">M Breitbart</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Parks, Dh" uniqKey="Parks D">DH Parks</name>
</author>
<author><name sortKey="Beiko, Rg" uniqKey="Beiko R">RG Beiko</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sharon, I" uniqKey="Sharon I">I Sharon</name>
</author>
<author><name sortKey="Bercovici, S" uniqKey="Bercovici S">S Bercovici</name>
</author>
<author><name sortKey="Pinter, Ry" uniqKey="Pinter R">RY Pinter</name>
</author>
<author><name sortKey="Shlomi, T" uniqKey="Shlomi T">T Shlomi</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Prakash, T" uniqKey="Prakash T">T Prakash</name>
</author>
<author><name sortKey="Taylor, Td" uniqKey="Taylor T">TD Taylor</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Sun, Sl" uniqKey="Sun S">SL Sun</name>
</author>
<author><name sortKey="Chen, J" uniqKey="Chen J">J Chen</name>
</author>
<author><name sortKey="Li, Wz" uniqKey="Li W">WZ Li</name>
</author>
<author><name sortKey="Altintas, I" uniqKey="Altintas I">I Altintas</name>
</author>
<author><name sortKey="Lin, A" uniqKey="Lin A">A Lin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Hoff, Kj" uniqKey="Hoff K">KJ Hoff</name>
</author>
<author><name sortKey="Lingner, T" uniqKey="Lingner T">T Lingner</name>
</author>
<author><name sortKey="Meinicke, P" uniqKey="Meinicke P">P Meinicke</name>
</author>
<author><name sortKey="Tech, M" uniqKey="Tech M">M Tech</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Apea, Dempster" uniqKey="Apea D">Dempster APea</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Engeman, Rm" uniqKey="Engeman R">RM Engeman</name>
</author>
<author><name sortKey="Sugihara, Rt" uniqKey="Sugihara R">RT Sugihara</name>
</author>
<author><name sortKey="Pank, Lf" uniqKey="Pank L">LF Pank</name>
</author>
<author><name sortKey="Dusenberry, We" uniqKey="Dusenberry W">WE Dusenberry</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Steffen, Mm" uniqKey="Steffen M">MM Steffen</name>
</author>
<author><name sortKey="Li, Z" uniqKey="Li Z">Z Li</name>
</author>
<author><name sortKey="Effler, Tc" uniqKey="Effler T">TC Effler</name>
</author>
<author><name sortKey="Hauser, Lj" uniqKey="Hauser L">LJ Hauser</name>
</author>
<author><name sortKey="Boyer, Gl" uniqKey="Boyer G">GL Boyer</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Ready, D" uniqKey="Ready D">D Ready</name>
</author>
<author><name sortKey="Pratten, J" uniqKey="Pratten J">J Pratten</name>
</author>
<author><name sortKey="Roberts, Ap" uniqKey="Roberts A">AP Roberts</name>
</author>
<author><name sortKey="Bedi, R" uniqKey="Bedi R">R Bedi</name>
</author>
<author><name sortKey="Mullany, P" uniqKey="Mullany P">P Mullany</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Fozo, Em" uniqKey="Fozo E">EM Fozo</name>
</author>
<author><name sortKey="Scott Anne, K" uniqKey="Scott Anne K">K Scott-Anne</name>
</author>
<author><name sortKey="Koo, H" uniqKey="Koo H">H Koo</name>
</author>
<author><name sortKey="Quivey, Rg" uniqKey="Quivey R">RG Quivey</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Seshadri, G" uniqKey="Seshadri G">G Seshadri</name>
</author>
<author><name sortKey="Myers, Gsa" uniqKey="Myers G">GSA Myers</name>
</author>
<author><name sortKey="Tettelin, H" uniqKey="Tettelin H">H Tettelin</name>
</author>
<author><name sortKey="Eisen, Ja" uniqKey="Eisen J">JA Eisen</name>
</author>
<author><name sortKey="Heidelberg, Jf" uniqKey="Heidelberg J">JF Heidelberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Park, Sn" uniqKey="Park S">SN Park</name>
</author>
<author><name sortKey="Kong, Sw" uniqKey="Kong S">SW Kong</name>
</author>
<author><name sortKey="Kim, Hs" uniqKey="Kim H">HS Kim</name>
</author>
<author><name sortKey="Park, Ms" uniqKey="Park M">MS Park</name>
</author>
<author><name sortKey="Lee, Jw" uniqKey="Lee J">JW Lee</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Yoshimura, M" uniqKey="Yoshimura M">M Yoshimura</name>
</author>
<author><name sortKey="Nakano, Y" uniqKey="Nakano Y">Y Nakano</name>
</author>
<author><name sortKey="Yamashita, Y" uniqKey="Yamashita Y">Y Yamashita</name>
</author>
<author><name sortKey="Oho, T" uniqKey="Oho T">T Oho</name>
</author>
<author><name sortKey="Saito, T" uniqKey="Saito T">T Saito</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">PLoS One</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS ONE</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">plosone</journal-id>
<journal-title-group><journal-title>PLoS ONE</journal-title>
</journal-title-group>
<issn pub-type="epub">1932-6203</issn>
<publisher><publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">25198674</article-id>
<article-id pub-id-type="pmc">4157783</article-id>
<article-id pub-id-type="publisher-id">PONE-D-14-15786</article-id>
<article-id pub-id-type="doi">10.1371/journal.pone.0106588</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject>
</subj-group>
<subj-group subj-group-type="Discipline-v2"><subject>Biology and Life Sciences</subject>
<subj-group><subject>Computational Biology</subject>
<subj-group><subject>Genomics Statistics</subject>
</subj-group>
</subj-group>
<subj-group><subject>Genetics</subject>
<subj-group><subject>Genomics</subject>
<subj-group><subject>Metagenomics</subject>
</subj-group>
</subj-group>
</subj-group>
<subj-group><subject>Molecular Biology</subject>
<subj-group><subject>Molecular Biology Techniques</subject>
<subj-group><subject>Sequencing Techniques</subject>
<subj-group><subject>Sequence Analysis</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v2"><subject>Research and Analysis Methods</subject>
<subj-group><subject>Database and Informatics Methods</subject>
<subj-group><subject>Bioinformatics</subject>
</subj-group>
</subj-group>
</subj-group>
</article-categories>
<title-group><article-title>Statistical Approach of Functional Profiling for a Microbial Community</article-title>
<alt-title alt-title-type="running-head">Functional Metagenomics</alt-title>
</title-group>
<contrib-group><contrib contrib-type="author" equal-contrib="yes"><name><surname>An</surname>
<given-names>Lingling</given-names>
</name>
<xref ref-type="aff" rid="aff1"><sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup>
</xref>
<xref ref-type="corresp" rid="cor1"><sup>*</sup>
</xref>
</contrib>
<contrib contrib-type="author" equal-contrib="yes"><name><surname>Pookhao</surname>
<given-names>Nauromal</given-names>
</name>
<xref ref-type="aff" rid="aff1"><sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Jiang</surname>
<given-names>Hongmei</given-names>
</name>
<xref ref-type="aff" rid="aff3"><sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Xu</surname>
<given-names>Jiannong</given-names>
</name>
<xref ref-type="aff" rid="aff4"><sup>4</sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff1"><label>1</label>
<addr-line>Department of Agricultural & Biosystems Engineering, University of Arizona, Tucson, Arizona, United States of America</addr-line>
</aff>
<aff id="aff2"><label>2</label>
<addr-line>Interdisciplinary Programs in Statistics, University of Arizona, Tucson, Arizona, United States of America</addr-line>
</aff>
<aff id="aff3"><label>3</label>
<addr-line>Department of Statistics, Northwestern University, Evanston, Illinois, United States of America</addr-line>
</aff>
<aff id="aff4"><label>4</label>
<addr-line>Department of Biology, New Mexico State University, Las Cruces, New Mexico, United States of America</addr-line>
</aff>
<contrib-group><contrib contrib-type="editor"><name><surname>Tang</surname>
<given-names>Haixu</given-names>
</name>
<role>Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1"><addr-line>Indiana University, United States of America</addr-line>
</aff>
<author-notes><corresp id="cor1">* E-mail: <email>anling@email.arizona.edu</email>
</corresp>
<fn fn-type="conflict"><p><bold>Competing Interests: </bold>
The authors have declared that no competing interests exist.</p>
</fn>
<fn fn-type="con"><p>Conceived and designed the experiments: LA. Performed the experiments: NP LA. Analyzed the data: LA NP. Contributed reagents/materials/analysis tools: HJ JX. Contributed to the writing of the manuscript: LA NP.</p>
</fn>
</author-notes>
<pub-date pub-type="collection"><year>2014</year>
</pub-date>
<pub-date pub-type="epub"><day>8</day>
<month>9</month>
<year>2014</year>
</pub-date>
<volume>9</volume>
<issue>9</issue>
<elocation-id>e106588</elocation-id>
<history><date date-type="received"><day>8</day>
<month>4</month>
<year>2014</year>
</date>
<date date-type="accepted"><day>31</day>
<month>7</month>
<year>2014</year>
</date>
</history>
<permissions><copyright-year>2014</copyright-year>
<copyright-holder>An et al</copyright-holder>
<license><license-p>This is an open-access article distributed under the terms of the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License</ext-link>
, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</license-p>
</license>
</permissions>
<abstract><sec><title>Background</title>
<p>Metagenomics is a relatively new but fast growing field within environmental biology and medical sciences. It enables researchers to understand the diversity of microbes, their functions, cooperation, and evolution in a particular ecosystem. Traditional methods in genomics and microbiology are not efficient in capturing the structure of the microbial community in an environment. Nowadays, high-throughput next-generation sequencing technologies are powerfully driving the metagenomic studies. However, there is an urgent need to develop efficient statistical methods and computational algorithms to rapidly analyze the massive metagenomic short sequencing data and to accurately detect the features/functions present in the microbial community. Although several issues about functions of metagenomes at pathways or subsystems level have been investigated, there is a lack of studies focusing on functional analysis at a low level of a hierarchical functional tree, such as SEED subsystem tree.</p>
</sec>
<sec><title>Results</title>
<p>A two-step statistical procedure (metaFunction) is proposed to detect all possible functional roles at the low level from a metagenomic sample/community. In the first step a statistical mixture model is proposed at the base of gene codons to estimate the abundances for the candidate functional roles, with sequencing error being considered. As a gene could be involved in multiple biological processes the functional assignment is therefore adjusted by utilizing an error distribution in the second step. The performance of the proposed procedure is evaluated through comprehensive simulation studies. Compared with other existing methods in metagenomic functional analysis the new approach is more accurate in assigning reads to functional roles, and therefore at more general levels. The method is also employed to analyze two real data sets.</p>
</sec>
<sec><title>Conclusions</title>
<p>metaFunction is a powerful tool in accurate profiling functions in a metagenomic sample.</p>
</sec>
</abstract>
<funding-group><funding-statement>This work was supported by National Science Foundation [DMS-1043080 to HJ and LA] and [DMS-1222592 to LA, HJ, JX], and partially supported by National Institutes of Health [P30 ES006694 to LA] and by The Cecil Miller Endowment at University of Arizona Foundation to NP. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</funding-statement>
</funding-group>
<counts><page-count count="11"></page-count>
</counts>
<custom-meta-group><custom-meta id="data-availability"><meta-name>Data Availability</meta-name>
<meta-value>The authors confirm that all data underlying the findings are fully available without restriction. The simulated data can be found at the website of the proposed software - metaFunction. The real data used in this manuscript are downloaded from the corresponding public database/website, which is addressed in the manuscript.</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
<notes><title>Data Availability</title>
<p>The authors confirm that all data underlying the findings are fully available without restriction. The simulated data can be found at the website of the proposed software - metaFunction. The real data used in this manuscript are downloaded from the corresponding public database/website, which is addressed in the manuscript.</p>
</notes>
</front>
<body><sec id="s1"><title>Introduction</title>
<p>Metagenomics is the study of genetic material recovered directly from natural (e.g., soil or seawater) or host-associated (e.g., human gut) environmental samples that contain microorganisms organized into communities. The advancement of high-throughput next generation sequencing technologies provides a powerful way in metagenomic studies since they can be directly applied to an environmental sample without the need of isolating and culturing individual microbial species in a laboratory. More than 99% of millions microbial species on Earth cannot be cultured in a laboratory [1,2]. The massively parallel sequencing technologies, such as 454FLX, Illumina Genome Analyzer (GA), and ABI SOLiD, have enabled us to generate millions of reads (35-500 base pairs (bp), depending on the platform) at a time [3]</p>
<p>The initial computational analysis of metagenomics focuses on two main questions: who is out there and what they can do [1,2]. To answer the first question, scientists determine taxonomic compositions in a particular metagenomic sample and determine the abundance/proportions of the species. Many methods have been proposed [4–7], particularly, TAMER8], GASSiC [9], and TAEC [10] focus on the taxonamic analysis at a very low phylogentic level - species.</p>
<p>To answer the question “what they can do” scientists need to determine the gene contents, functional categories, and estimate the relative functional abundances contributed in the metagenomic sample. According to Overbeek et al. [11], a functional role corresponds roughly to a single logical role that a gene or gene product may play in the operation of a cell, such as ‘Aspartokinase (EC 2.7.2.4)’, and pathway or subsystem which is a collection of related functional roles (<xref ref-type="fig" rid="pone-0106588-g001">Figure 1</xref>
). To characterize the functional capacity of a metagenomic community, therefore, researchers can perform analysis either at the functional role level or pathways/subsystems level. Most recently published studies focused on pathways or subsystems level [12–15]. However, a number of questions about functional roles of microbial communities are still ambiguous, e.g., do microbial communities consist of extensive genetic diversity, how are they diverse in functional roles, how does the diversity in functional roles of microbial communities affect their interaction with environment? Performing function analysis of metagenomes at functional roles level, therefore, is an appropriate approach to addressing these issues. Through such type of analysis, functional roles can be detected and further metabolic pathways or subsystems that the functional roles are involved can be established [14].</p>
<fig id="pone-0106588-g001" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0106588.g001</object-id>
<label>Figure 1</label>
<caption><title>Illustration of subsystem tree structure in SEED.</title>
</caption>
<graphic xlink:href="pone.0106588.g001"></graphic>
</fig>
<p>Several tools have been developed to detect/annotate functional roles from a metagenomic sample [16]. Among the commonly used publicly available pipelines, most of them are homology-based tools, such as MEGAN [17], MG-RAST [18], IMG/M [19], and CAMERA [20]. In MEGAN the functional analysis of metagenomes is based on the SEED hierarchy [18]. The SEED has consistent and accurate microbial genome annotations of any publicly available source [11]. To perform a functional analysis, MEGAN assigns each read to the functional role of the highest scoring gene in a BLAST comparison against a protein database (e.g., NCBI-NR), and then different functional roles are grouped into SEED subsystems. The SEED classification can be represented by a hierarchical tree, where the internal nodes represent subsystems and the leaves denote the functional roles (<xref ref-type="fig" rid="pone-0106588-g001">Figure 1</xref>
).</p>
<p>However the MEGAN program has several disadvantages. First of all, the best score assignment might miss putative functions. Because of the existence of sequencing error [21], a sequence read could come from a gene/function with aligned matches of 32 out of 33 codons and could also from a gene/function with aligned match of 31 out of 33 codons. The MEGAN method misses the second or even the third best scoring functions that the read may have. Furthermore, a gene could play multiple functions at the same time. However MEGAN just assigns one function (with the best match value) to the short read even when multiple functions show the same best match values (e.g., the e-value, bitscore, or the number of matched codons). For example, blastx output for a short read shows two functions “<italic>Argininosuccinate lyase (EC 4.3.2.1)</italic>
” and “<italic>N-acetylglutamate synthase (EC 2.3.1.1)</italic>
” with the same best match values, but MEGAN only assigns the first function (alphabetically) to the read. Thus, MEGAN misses some functions existing in the community and therefore underestimates their abundance.</p>
<p>MG-RAST [18] can assign multiple functions to a read, but some flat cutoffs, e.g., e-value < 1.0e-5 and identity cutoff > 60% are used. Thus assignment of reads to different ranks of taxonomy tree greatly depends on the threshold of bit-score or Expect value used. As a consequence, the results lack specificity. IMG/M uses the best BLAST hits for function assignment [19]. In CAMER [20] open reading frames (ORFs) are clustered at a certain cutoff of identity (e.g., 60%) over a certain threshold (e.g., 80%) of ORF length. ORF clusters are then used for functional studies. Both the best-hit approach (in MEGAN and IMG/M) and objective cutoff approach (in MG-RAST and CAMER) lack of statistical support.</p>
<p>Motivated by both the advantages and limitations of these methods and inspired by the statistical model in Jiang et al. [8] we propose a two-step procedure to accurately assign functions to reads. In the first step sequencing error is estimated through a mixture model, which is proposed to model the translated sequence reads at the base of codons and detect the possible functions in a metagenomic sample. As a gene could be involved in multiple biological processes, the functional assignment is adjusted by utilizing an error distribution at the second step. The proposed two-step method is comprehensively tested on simulated metagenomic data with diverse complexity of microbial community structure, and also applied on two real metagenomic datasets. Compared with MEGAN and MG-RAST for functional metagenomic analysis, the proposed approach demonstrates greater accuracy in function identification and abundance quantification. The R package “metaFunction” is available for download at <ext-link ext-link-type="uri" xlink:href="http://cals.arizona.edu/~anling/software.htm">http://cals.arizona.edu/~anling/software.htm</ext-link>
.</p>
</sec>
<sec sec-type="methods" id="s2"><title>Methods</title>
<p>For each sequence dataset we use BLASTX to search for matched reference sequences (i.e., genes) in the NCBI-NR protein database. Then genes are classified into functional role categories as defined by the SEED classification. Based on the sequence reads we need to estimate: (1) the sequencing error rate and (2) functional roles contained in the metagenomic sample and their relative abundance (i.e., proportions). To answer these questions, we set up a mixture model based on the information from BLASTX results. And then a binomial model for the sequencing error (estimated by the mixture model) is proposed to adjust the function assignment, and therefore the proportion estimation for each function is adjusted accordingly. The adjustment on assigning functions is to incorporate the fact that a gene/short read could play multiple function roles. The flowchart for the proposed procedure can be found in <xref ref-type="fig" rid="pone-0106588-g002">Figure 2</xref>
.</p>
<fig id="pone-0106588-g002" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0106588.g002</object-id>
<label>Figure 2</label>
<caption><title>Flowchart of the proposed method - metaFunction.</title>
</caption>
<graphic xlink:href="pone.0106588.g002"></graphic>
</fig>
<sec id="s2a"><title>Estimate sequencing error</title>
<p>Suppose we have <italic>n</italic>
 sequence reads that are mapped to sequence homologs in the reference database (i.e., NR protein database) and return <italic>K</italic>
 functions (i.e., gene families) in the result of homolog research, e.g., BLASTX output. Let <inline-formula><inline-graphic xlink:href="pone.0106588.e001.jpg"></inline-graphic>
</inline-formula>
denote the number of identical matched codons for read <italic>i</italic>
 under functional role <italic>j</italic>
 and <inline-formula><inline-graphic xlink:href="pone.0106588.e002.jpg"></inline-graphic>
</inline-formula>
 represent the corresponding aligned codon length. Let<inline-formula><inline-graphic xlink:href="pone.0106588.e003.jpg"></inline-graphic>
</inline-formula>
 denote the maximum aligned codon length for read <italic>i</italic>
 across all candidate functions, i.e., <inline-formula><inline-graphic xlink:href="pone.0106588.e004.jpg"></inline-graphic>
</inline-formula>
 then we have <inline-formula><inline-graphic xlink:href="pone.0106588.e005.jpg"></inline-graphic>
</inline-formula>
. If the read <italic>i</italic>
 does not have matched sequences for function <italic>j</italic>
, then<inline-formula><inline-graphic xlink:href="pone.0106588.e006.jpg"></inline-graphic>
</inline-formula>
. We assume that the larger the <inline-formula><inline-graphic xlink:href="pone.0106588.e007.jpg"></inline-graphic>
</inline-formula>
value, the more likely that the read <italic>i</italic>
 performs function <italic>j</italic>
. Let <inline-formula><inline-graphic xlink:href="pone.0106588.e008.jpg"></inline-graphic>
</inline-formula>
 denote the proportion of reads having function <italic>j</italic>
, thus <inline-formula><inline-graphic xlink:href="pone.0106588.e009.jpg"></inline-graphic>
</inline-formula>
. Even if the read <italic>i</italic>
 is from function <italic>j</italic>
, it is also possible that <inline-formula><inline-graphic xlink:href="pone.0106588.e010.jpg"></inline-graphic>
</inline-formula>
is not exactly as same as <inline-formula><inline-graphic xlink:href="pone.0106588.e011.jpg"></inline-graphic>
</inline-formula>
, the maximum aligned length. It may be due to the sequencing error and/or single nucleotide polymorphism (SNP) effect or various sources (i.e., organisms) for the same gene in the database. Let <italic>p</italic>
 denote the probability of observing a mismatched codon, then <inline-formula><inline-graphic xlink:href="pone.0106588.e012.jpg"></inline-graphic>
</inline-formula>
 is the probability of observing an identity or conserved codon. Therefore the probability that the read <italic>i</italic>
 performs function <italic>j</italic>
 with <inline-formula><inline-graphic xlink:href="pone.0106588.e013.jpg"></inline-graphic>
</inline-formula>
 matched codons and <inline-formula><inline-graphic xlink:href="pone.0106588.e014.jpg"></inline-graphic>
</inline-formula>
mismatched codons is <inline-formula><inline-graphic xlink:href="pone.0106588.e015.jpg"></inline-graphic>
</inline-formula>
. Then the probability to observe the read <italic>i</italic>
 in the dataset is <disp-formula id="pone.0106588.e016"><graphic xlink:href="pone.0106588.e016.jpg" position="anchor" orientation="portrait"></graphic>
<label>(1)</label>
</disp-formula>
</p>
<p>Hence the likelihood function of the data is: <disp-formula id="pone.0106588.e017"><graphic xlink:href="pone.0106588.e017.jpg" position="anchor" orientation="portrait"></graphic>
<label>(2)</label>
</disp-formula>
</p>
<p>In this likelihood function, the maximum aligned length <inline-formula><inline-graphic xlink:href="pone.0106588.e018.jpg"></inline-graphic>
</inline-formula>
 and the matches <inline-formula><inline-graphic xlink:href="pone.0106588.e019.jpg"></inline-graphic>
</inline-formula>
can be extracted from the BLASTX output. The parameters <italic>p</italic>
 and <inline-formula><inline-graphic xlink:href="pone.0106588.e020.jpg"></inline-graphic>
</inline-formula>
 are then estimated by Expectation Maximization (EM) algorithm [22]. As <italic>p</italic>
 is the probability for observing a mismatched codon, for simplicity, we just call <italic>p</italic>
 as sequencing error (rate) and a mismatched codon as a mismatch.</p>
</sec>
<sec id="s2b"><title>Multiple-function assignment</title>
<p>One read could get involved in multiple functional roles. For read <italic>i</italic>
, assume its best mismatch (i.e., minimum number of mismatched codons) across all functions is <inline-formula><inline-graphic xlink:href="pone.0106588.e021.jpg"></inline-graphic>
</inline-formula>
, we can determine the maximum allowable mismatch <inline-formula><inline-graphic xlink:href="pone.0106588.e022.jpg"></inline-graphic>
</inline-formula>
 for a given small probability <italic>ε</italic>
 such that:<disp-formula id="pone.0106588.e023"><graphic xlink:href="pone.0106588.e023.jpg" position="anchor" orientation="portrait"></graphic>
<label>(3)</label>
</disp-formula>
where we assume that the mismatch <inline-formula><inline-graphic xlink:href="pone.0106588.e024.jpg"></inline-graphic>
</inline-formula>
 follows a binomial distribution with parameters <inline-formula><inline-graphic xlink:href="pone.0106588.e025.jpg"></inline-graphic>
</inline-formula>
. Then read <italic>i</italic>
 can be assigned to all the functions with mismatch ≤<inline-formula><inline-graphic xlink:href="pone.0106588.e026.jpg"></inline-graphic>
</inline-formula>
. The relative abundance <inline-formula><inline-graphic xlink:href="pone.0106588.e027.jpg"></inline-graphic>
</inline-formula>
will be updated by this new multiple function role assignment, i.e., the updated one becomes:<inline-formula><inline-graphic xlink:href="pone.0106588.e028.jpg"></inline-graphic>
</inline-formula>
, where <inline-formula><inline-graphic xlink:href="pone.0106588.e029.jpg"></inline-graphic>
</inline-formula>
is the number of short reads assigned to the function <italic>j</italic>
 after the adjustment, and <italic>n</italic>
 is the total number of short reads in the dataset. Thus we have <inline-formula><inline-graphic xlink:href="pone.0106588.e030.jpg"></inline-graphic>
</inline-formula>
. The algorithm procedure for the multiple-function adjustment can be summarized as below:</p>
<p>Based on the estimated sequencing error obtained in step 1 and a pre-specified small probability <italic>ε</italic>
:</p>
<list list-type="order"><list-item><p>for read <italic>i</italic>
 calculate its maximum allowable mismatch <inline-formula><inline-graphic xlink:href="pone.0106588.e031.jpg"></inline-graphic>
</inline-formula>
 using eq. (3)</p>
</list-item>
<list-item><p>assign all functions with corresponding mismatch ≤<inline-formula><inline-graphic xlink:href="pone.0106588.e032.jpg"></inline-graphic>
</inline-formula>
 to read <italic>i</italic>
</p>
</list-item>
<list-item><p>repeat steps 1) and 2) for all reads</p>
</list-item>
<list-item><p>calculate the new roportion <inline-formula><inline-graphic xlink:href="pone.0106588.e033.jpg"></inline-graphic>
</inline-formula>
 for each function based on these new assignments.</p>
</list-item>
</list>
<p><xref ref-type="fig" rid="pone-0106588-g003">Figure 3</xref>
 illustrates the calculation for multiple function assignment based on a binomial distribution. In this illustration the maximum length of the aligned codons is 32 and sequencing error at the base of codon is 0.15. If the best mismatch is 0 and the probability <italic>ε</italic>
 = 0.05, then the maximum allowable mismatch is calculated as 1. It means that the functions with matched codons of 32 or 31 ( = 32-1) in BLASTX output are all possible and therefore the read is finally assigned to these functions. The small probability <italic>ε</italic>
 in <xref ref-type="disp-formula" rid="pone.0106588.e023">equation (3</xref>
) is suggested as one third of the sequencing error at the codon level (estimated in the first step) or just the sequencing error at the nucleotide level, if known or given. More details about the selection of <italic>ε</italic>
 can be found in the simulation studies below.</p>
<fig id="pone-0106588-g003" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0106588.g003</object-id>
<label>Figure 3</label>
<caption><title>Illustration of calculation of multiple function assignment.</title>
<p>In this plot <italic>ε</italic>
 = 0.05 and the binomial distribution has <italic>p</italic>
 = 0.15 and <italic>L<sub>i</sub>
 = </italic>
32.</p>
</caption>
<graphic xlink:href="pone.0106588.g003"></graphic>
</fig>
</sec>
<sec id="s2c"><title>Construct statistical inferences</title>
<p>None of the existing methods on functional metagenomic analysis could further assess the uncertainty of the proportion of assigned reads to functions. We propose to use bootstrap method [23] for constructing the confidence intervals for the estimates. We first draw a bootstrap sample by resampling the reads from the original sequence reads with replacement; the relative abundances are estimated using the described two-step procedure for the bootstrap sample. We repeat this resampling/bootstrap for a large number of times, e.g., 1000 times. Then the confidence intervals can be constructed based on these bootstrap estimates. Since we construct the confidence intervals for the abundances of the <italic>K</italic>
 functions, <italic>R<sub>j</sub>
</italic>
 (<italic>j</italic>
 = 1,…, <italic>K</italic>
) simultaneously, a multiple correction method, e.g., Bonferroni method [24], is applied to guarantee a pre-specified family-wise confidence level.</p>
</sec>
<sec id="s2d"><title>Simulation studies</title>
<sec id="s2d1"><title>Experimental data</title>
<p>Due to the complexity of metagenomic data, simulation studies with verifiable structure are crucial to benchmark the proposed approach and to conduct comparisons with other existing methods. So far there is no literature about how to set up a simulation study for functional metagenomics. We propose to use the SEED database (<ext-link ext-link-type="uri" xlink:href="http://pseed.theseed.org">http://pseed.theseed.org</ext-link>
) and conduct six different simulation studies. Basic information of these six simulation settings is listed in <xref ref-type="table" rid="pone-0106588-t001">Table 1</xref>
. Similar to the studies in MetaSim [25] which contain a small number of genomes in each setting we simulate a small number of functions in each study.</p>
<table-wrap id="pone-0106588-t001" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0106588.t001</object-id>
<label>Table 1</label>
<caption><title>Basic information of six simulation studies.</title>
</caption>
<alternatives><graphic id="pone-0106588-t001-1" xlink:href="pone.0106588.t001"></graphic>
<table frame="hsides" rules="groups"><colgroup span="1"><col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead><tr><td align="left" rowspan="1" colspan="1">Study</td>
<td align="left" rowspan="1" colspan="1">Characteristic of the 10 primary functional roles</td>
<td align="left" rowspan="1" colspan="1">Sampling rate from SEED database</td>
</tr>
</thead>
<tbody><tr><td align="left" rowspan="1" colspan="1">1</td>
<td align="left" rowspan="1" colspan="1">Different</td>
<td align="left" rowspan="1" colspan="1">fixed 20%</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">2</td>
<td align="left" rowspan="1" colspan="1">Different</td>
<td align="left" rowspan="1" colspan="1">20∼40%</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">3</td>
<td align="left" rowspan="1" colspan="1">Closely related</td>
<td align="left" rowspan="1" colspan="1">fixed 20%</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">4</td>
<td align="left" rowspan="1" colspan="1">Closely related</td>
<td align="left" rowspan="1" colspan="1">20∼40%</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">5</td>
<td align="left" rowspan="1" colspan="1">Same as study 1 & 2</td>
<td align="left" rowspan="1" colspan="1">Large sample size</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">6</td>
<td align="left" rowspan="1" colspan="1">Same as study 3 & 4</td>
<td align="left" rowspan="1" colspan="1">Large sample size</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<p>Study 1 contains 10 function roles that are far away from each other in the SEED tree. For each function role, 20% of the sequences (i.e., FIGfams, very long sequences) from the SEED database are chosen and the sampling rate for this situation is 20%. Then a short segment of 100 bp is randomly chopped from the selected long sequence, and 2% sequencing error is added to it. The sequencing error could be due to the substitution, deletion and insertion. For the purpose of method illustration we only consider the substitution error. It is well know that some genes are involved in multiple functions in a microbial community. This is also reflected from the gene sequences in the SEED database, i.e., some long sequences are labeled with multiple functions. As expected, a few additional function names are obtained for the short reads in the 10 pre-selected groups. We name them secondary functions, and the 10 pre-selected functions as primary functions (see the Table S1 in <xref ref-type="supplementary-material" rid="pone.0106588.s002">File S1</xref>
). The number of short reads for each function is also listed in the table S1 in <xref ref-type="supplementary-material" rid="pone.0106588.s002">File S1</xref>
. Both types of functions are treated as true functions since the functions in either type are the true ones for the generated short reads.</p>
<p>Study 2 contains the same 10 primary functions as study 1 but with various sampling rate (see the Table S1 in <xref ref-type="supplementary-material" rid="pone.0106588.s002">File S1</xref>
). The number of short sequence reads generated for each function is based on the total number of long sequences in the function group in the SEED database. Generally, the sampling rate varies between 20%∼40%. In studies 3 and 4 we use another set of 10 functions (see the Table S2 in <xref ref-type="supplementary-material" rid="pone.0106588.s002">File S1</xref>
). Different from the studies 1 and 2, the 10 function groups here are very closely related (i.e., some functional roles are belong to the same subsystems). Study 5 contains the same 10 primary function groups as studies 1 & 2 but the sampling rate is much larger, about 4∼5 times of the first two studies; similarly, study 6 contains the same 10 primary function groups as studies 3 & 4 but the sampling rate is about 4∼5 times of these two studies (see the Table S1 and S2 in <xref ref-type="supplementary-material" rid="pone.0106588.s002">File S1</xref>
). The coverage, i.e., ratio of the number of simulated base pairs to the total number of base pairs for the selected functions in the SEED database, varies 2%∼9% in these six studies.</p>
</sec>
<sec id="s2d2"><title>Simulation Results</title>
<p>Three methods, MEGAN (best hit), MG-RAST (flat cutoff) and the proposed method metaFunction, are compared through these six simulation studies. The result for the first simulation study is shown in <xref ref-type="fig" rid="pone-0106588-g004">Figure 4</xref>
 where it plots the relationship between the estimated (i.e., predicted) abundance for each function and its true (i.e., expected) abundance. If all the functions are detected and their abundances are correctly estimated then the Pearson correlation between the expected and predicted abundances is one. From the plot it is obvious that the proposed approach has the largest correlation. <xref ref-type="table" rid="pone-0106588-t002">Table 2</xref>
 displays the summary of the correlations in all six studies for these three methods. The new method outperforms the other two methods in all studies in terms of correlation between the true and estimated abundances. While MEGAN and metaFunction methods perform better on the distant function studies (studies 1, 2 and 5) than on the closely related function ones (studies 3, 4 and 6), MG-RAST seems work better on the closely related functions. It is because in the distant function studies MG-RAST detects a false function “decarboxylase” with a very large proportion. This greatly affects the correlation calculation.</p>
<fig id="pone-0106588-g004" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0106588.g004</object-id>
<label>Figure 4</label>
<caption><title>Scatter plot of the predicted vs. expected (true) relative abundance of the functions in Simulation 1.</title>
</caption>
<graphic xlink:href="pone.0106588.g004"></graphic>
</fig>
<table-wrap id="pone-0106588-t002" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0106588.t002</object-id>
<label>Table 2</label>
<caption><title>Summary of the correlation values in all six studies by three methods.</title>
</caption>
<alternatives><graphic id="pone-0106588-t002-2" xlink:href="pone.0106588.t002"></graphic>
<table frame="hsides" rules="groups"><colgroup span="1"><col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead><tr><td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">Study 1</td>
<td align="left" rowspan="1" colspan="1">Study 2</td>
<td align="left" rowspan="1" colspan="1">Study 3</td>
<td align="left" rowspan="1" colspan="1">Study 4</td>
<td align="left" rowspan="1" colspan="1">Study 5</td>
<td align="left" rowspan="1" colspan="1">Study 6</td>
</tr>
</thead>
<tbody><tr><td align="left" rowspan="1" colspan="1">MEGAN</td>
<td align="left" rowspan="1" colspan="1">0.986</td>
<td align="left" rowspan="1" colspan="1">0.968</td>
<td align="left" rowspan="1" colspan="1">0.852</td>
<td align="left" rowspan="1" colspan="1">0.839</td>
<td align="left" rowspan="1" colspan="1">0.973</td>
<td align="left" rowspan="1" colspan="1">0.917</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">MG-RAST</td>
<td align="left" rowspan="1" colspan="1">0.711</td>
<td align="left" rowspan="1" colspan="1">0.696</td>
<td align="left" rowspan="1" colspan="1">0.880</td>
<td align="left" rowspan="1" colspan="1">0.857</td>
<td align="left" rowspan="1" colspan="1">0.750</td>
<td align="left" rowspan="1" colspan="1">0.895</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">metaFunction</td>
<td align="left" rowspan="1" colspan="1">0.996</td>
<td align="left" rowspan="1" colspan="1">0.993</td>
<td align="left" rowspan="1" colspan="1">0.953</td>
<td align="left" rowspan="1" colspan="1">0.943</td>
<td align="left" rowspan="1" colspan="1">0.997</td>
<td align="left" rowspan="1" colspan="1">0.982</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot><fn id="nt101"><label></label>
<p>The correlation is calculated between the expected (i.e., true) and estimated abundance for the simulated functions.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>We also evaluate the performance of three methods via the same simulations using another metric. A common measure for error is root mean square of relative error [7,26]. In this definition each feature group is assumed the same weight in the error calculation, regardless the abundance of features in each group. In function analysis of metagenomics a function group estimated with a tiny number of counts actually should much less likely exist in the sample than a group with large number of read counts. We modify the error measure to weighted root mean square of relative error (WRRMSE), i.e.,</p>
<p><inline-formula><inline-graphic xlink:href="pone.0106588.e034.jpg"></inline-graphic>
</inline-formula>
, where the weight <inline-formula><inline-graphic xlink:href="pone.0106588.e035.jpg"></inline-graphic>
</inline-formula>
,</p>
<p><inline-formula><inline-graphic xlink:href="pone.0106588.e036.jpg"></inline-graphic>
</inline-formula>
 is the estimated number of reads for function <italic>j</italic>
, <inline-formula><inline-graphic xlink:href="pone.0106588.e037.jpg"></inline-graphic>
</inline-formula>
 is the estimated relative abundance (i.e., estimated proportion) and <inline-formula><inline-graphic xlink:href="pone.0106588.e038.jpg"></inline-graphic>
</inline-formula>
 is the true relative abundance, and <italic>m</italic>
 is the number of true function groups. The WRRMSE results for six studies are shown in <xref ref-type="fig" rid="pone-0106588-g005">Figure 5</xref>
. In each of subplots the x-axis is the SEED system level. Compared to the MEGAN and MG-RAST, the proposed method has the lowest error at any level of the subsystems and for all simulation studies. Decrease in Error on Sub 2 level in <xref ref-type="fig" rid="pone-0106588-g005">Figure 5</xref>
 is due to the unnamed subsystems in the SEED tree. For example, a read is assigned to a level-3 subsystem but its parent node has no name (i.e., NULL) then the assignment to this unknown level-2 subsystem will be excluded in calculating the error. The decrease in error for sub 2 level is due to the removal of the NULL group that may contain some wrong assignments.</p>
<fig id="pone-0106588-g005" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0106588.g005</object-id>
<label>Figure 5</label>
<caption><title>Plot of WRRMSE values for three methods and in six simulation studies.</title>
<p>Weighted Root of Mean Square Relative Error (WRRMSE) is calculated between the true function/subsystem and the estimated function/subsytem by each method (MEGAN, MG-RAST, and metaFunction).</p>
</caption>
<graphic xlink:href="pone.0106588.g005"></graphic>
</fig>
<p>The accuracy on estimation of relative abundance plays an important role in metagenomic analysis, the accuracy of assignment of short reads is also very interesting to biologists in functional metagenomics as they need the information of what reads do what kind of functions. As the MG-RAST does not give the information of the assignment we compare the performance of MEGAN and the proposed method metaFunction regarding the assignment details. In each of six simulation studies we calculate the proportion of correctly assigned (CA), wrongly assigned (WA), and not assigned (NA, i.e., not aligned to the reference database) across all functions. The assignment details are also examined at other levels of the subsystem. The results of the simulation study 1 are displayed in <xref ref-type="table" rid="pone-0106588-t003">Table 3</xref>
. At any level of the subsystems (including the function level) the proportions of NA using metaFunction are lower than those from the MEGAN result. The WAs for metaFunction are little higher than the ones for the MEGAN but they are comparable (all <1%). The new approach results in much higher CA rate than MEGAN (about 90% vs 70%). Consistent conclusions are obtained for other simulation studies (data not shown).</p>
<table-wrap id="pone-0106588-t003" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0106588.t003</object-id>
<label>Table 3</label>
<caption><title>Proportion of correctly assigned (CA), wrongly assigned (WA), and not assigned (NA) simulated reads by MEGAN and metaFunction.</title>
</caption>
<alternatives><graphic id="pone-0106588-t003-3" xlink:href="pone.0106588.t003"></graphic>
<table frame="hsides" rules="groups"><colgroup span="1"><col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead><tr><td align="left" rowspan="1" colspan="1"></td>
<td colspan="3" align="left" rowspan="1">MEGAN</td>
<td colspan="3" align="left" rowspan="1">metaFunction</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">CA (%)</td>
<td align="left" rowspan="1" colspan="1">WA (%)</td>
<td align="left" rowspan="1" colspan="1">NA (%)</td>
<td align="left" rowspan="1" colspan="1">CA(%)</td>
<td align="left" rowspan="1" colspan="1">WA (%)</td>
<td align="left" rowspan="1" colspan="1">NA (%)</td>
</tr>
</thead>
<tbody><tr><td align="left" rowspan="1" colspan="1">Function</td>
<td align="left" rowspan="1" colspan="1">77.45</td>
<td align="left" rowspan="1" colspan="1">0.22</td>
<td align="left" rowspan="1" colspan="1">22.55</td>
<td align="left" rowspan="1" colspan="1">91.06</td>
<td align="left" rowspan="1" colspan="1">0.39</td>
<td align="left" rowspan="1" colspan="1">8.94</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Subsystem 3</td>
<td align="left" rowspan="1" colspan="1">77.22</td>
<td align="left" rowspan="1" colspan="1">0.22</td>
<td align="left" rowspan="1" colspan="1">22.78</td>
<td align="left" rowspan="1" colspan="1">91.05</td>
<td align="left" rowspan="1" colspan="1">0.38</td>
<td align="left" rowspan="1" colspan="1">8.95</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Subsystem 2</td>
<td align="left" rowspan="1" colspan="1">76.66</td>
<td align="left" rowspan="1" colspan="1">0.21</td>
<td align="left" rowspan="1" colspan="1">23.34</td>
<td align="left" rowspan="1" colspan="1">91.11</td>
<td align="left" rowspan="1" colspan="1">0.37</td>
<td align="left" rowspan="1" colspan="1">8.89</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Subsystem 1</td>
<td align="left" rowspan="1" colspan="1">76.66</td>
<td align="left" rowspan="1" colspan="1">0.21</td>
<td align="left" rowspan="1" colspan="1">23.34</td>
<td align="left" rowspan="1" colspan="1">91.11</td>
<td align="left" rowspan="1" colspan="1">0.37</td>
<td align="left" rowspan="1" colspan="1">8.89</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot><fn id="nt102"><label></label>
<p>This result is for different levels of the SEED tree in the first simulation study.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="s2d3"><title>Selection of ε</title>
<p>The above results are based on <inline-formula><inline-graphic xlink:href="pone.0106588.e039.jpg"></inline-graphic>
</inline-formula>
 in eq. (3). We conduct another study to investigate the effect of selecting different small probability <italic>ε</italic>
 on the final result in terms of the error metric defined above. Let <italic>ε</italic>
 take various values of 0.01, 0.05, or 0.1 for the multiple role assignment. Within each of the above six simulated experiments the WRRMSE values are very close for these three different <italic>ε</italic>
 values. That is, the absolute difference on WRRMSE is less than 0.0001 between the two situations of <italic>ε</italic>
 = 0.01 and <italic>ε</italic>
 = 0.05; and less than 0.008 between <italic>ε</italic>
 = 0.01 and <italic>ε</italic>
 = 0.10. In terms of relative difference on WRRMSE, the values are (0∼2%) for different <italic>ε.</italic>
 Therefore, the final result is not sensitive to the selection of <italic>ε</italic>
. In the above six simulated experiments 2% error is added to each short read, any value between 0.01 and 0.05 is plausible for <italic>ε.</italic>
</p>
</sec>
</sec>
</sec>
<sec id="s3"><title>Real Data Analysis</title>
<p>Real metagenomic data from an environmental study and a human health study are analyzed using the proposed method - metaFunction.</p>
<sec id="s3a"><title>Environmental study</title>
<p>Metagenomic functions were compared between Lake Erie (North America) and Lake Taihu (China) [27]. Toxic <italic>cyanobacteria</italic>
 blooms appear to be a global problem as toxins produced by bloom-associated <italic>cyanobacteria</italic>
 can have drastic impacts on the ecosystem and surrounding communities; in addition, the produced bloom biomass can disrupt aquatic food webs and act as a driver for hypoxia. Freshwater samples were collected from different lakes to examine the bloom associated microbial communities. We select two lakes - Lake Erie and Lake Taihu as they represent different continents – to examine the gene contents. After quality checking totally 750 thousands reads with an average length of 425 bp are aligned to the NCBI non-redundant database. Then the proposed method is applied to the alignment output. The original study used both MEGAN and MG-RAST for functional annotation and they addressed that the two results are highly consistent. We compare our result to the MG-RAST result in the original paper, which are downloaded from the MG-RAST online server (<ext-link ext-link-type="uri" xlink:href="http://metagenomics.anl.gov/">http://metagenomics.anl.gov/</ext-link>
) under the identification numbers 4467029.3 (Erie), 4467058.3 (Taihu).</p>
<p>The functionality profiles of microbial communities in these two lakes by metaFunction and MG-RAST are summarized at the level 1 of subsystem (<xref ref-type="fig" rid="pone-0106588-g006">Figure 6</xref>
). Generally, the results from these two approaches are consistent: the subsystems found by one method with big proportions are also detected by the other with large amount. However there also exists some discrepancy between the two results. The subsystem “Miscellaneous” is found dominant by MG-RAST in both lakes but not ample by the new method; “Virulence, Disease and Defense”, “Virulence”, “Membrane Transport”, and “Cell Wall and Capsule” are observed more abundant by the new approach than by MG-RAST.</p>
<fig id="pone-0106588-g006" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0106588.g006</object-id>
<label>Figure 6</label>
<caption><title>Proportions of the detected subsystems (level 1) by MG-RAST and metaFunction for the lake data.</title>
<p>The top 27 subsystems with proportion >0.005 in at least one of samples are listed. The “error” bars represent the 95% confidence interval obtained by bootstrap method. Note: only the proposed approach can provide confidence intervals for the estimations of the proportions.</p>
</caption>
<graphic xlink:href="pone.0106588.g006"></graphic>
</fig>
<p>When compare the results between two lakes we found that subsystems abundant in one lake by the MG-RAST often show plenty in the same lake by the new method. For instance, “<italic>Amino Acid and Derivatives</italic>
”, “<italic>Carbohydrates</italic>
”, “<italic>Nucleosides and Nucleotides</italic>
”, and “<italic>Membrane Transport</italic>
” are found more abundant in Lake Taihu than in Lake Erie. Meanwhile “<italic>Cell Division and Cell Cycle</italic>
”, “<italic>Regulation and Cell signaling</italic>
”, “<italic>Cofactors, Vitamins, Prosthetic Groups, Pigments</italic>
” are lower in Lake Taihu. A big difference tween the results from two approaches is that the new method can provide confidence interval information for the proportion estimation, which is displayed as the small bars in <xref ref-type="fig" rid="pone-0106588-g006">Figure 6</xref>
. Thus the new method can provide more information about the group comparisons. Comparison between two lakes at a lower level of subsystems - level 3 - is shown in the <xref ref-type="supplementary-material" rid="pone.0106588.s001">Figure S1</xref>
. Not surprising, the results from two approaches in <xref ref-type="supplementary-material" rid="pone.0106588.s001">Figure S1</xref>
 are more disparate than at the higher level of subsystems in <xref ref-type="fig" rid="pone-0106588-g006">Figure 6</xref>
.</p>
</sec>
<sec id="s3b"><title>Human Health study</title>
<p>Human oral microbial samples were studied for oral cavity problem using 454 pyrosequencing [28]. Two healthy samples and two cavity samples are selected for our analysis, with one at an intermediate stage and the other one at an advanced stage of caries development. After quality checking, 0.5 Gbp of sequence with the average read length 425 bp are BLASTXed to NCBI-NR protein database for searching matched reference sequences (i.e., genes). Then reads are classified into functional role categories as defined by the SEED structure using the proposed method. The results of functionality profiling for all four samples at the subsystem level 3 are shown in <xref ref-type="fig" rid="pone-0106588-g007">Figure 7</xref>
.</p>
<fig id="pone-0106588-g007" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0106588.g007</object-id>
<label>Figure 7</label>
<caption><title>Proportions of the detected subsystems (level 3) for the oral data.</title>
<p>The top 78 subsystems with proportion >0.005 in at least one of samples are listed. The “error” bars represent the 95% confidence intervals obtained by bootstrap method.</p>
</caption>
<graphic xlink:href="pone.0106588.g007"></graphic>
</fig>
<p>In this plot the abundance of “<italic>Conjugative transposon Bacteroidales</italic>
” is much higher in the cavity samples than in the healthy orals, which is also confirmed in other literature [29]; “<italic>Fatty Acid Biosynthesis FASI</italic>
” also shows a higher value in the diseased samples than in the healthy samples, which is consistent with the finding in [30]; That the “<italic>Flagellum</italic>
” is abundant in the cavity samples is also reported in Seshadri et al. [31]; high values of “<italic>Glutamine Glutamate, Aspartate and Asparagine Biosynthesis</italic>
” and of “<italic>Methionine degradation</italic>
” in the oral cavity samples are also mentioned in other publications [32,33]; the abundance of “<italic>Universal GTPases</italic>
” is higher in the cavity samples than in the healthy orals, which is also found in other literature [34]. In conclusion, the results from the new method provide us the findings consistent with the previous literatures.</p>
</sec>
</sec>
<sec id="s4"><title>Discussion</title>
<p>One of the main challenges in metagenomic studies is how to accurately identify all possible functional roles present in an environmental sample and precisely estimate their abundance. Due to the complexity of metagenomics and the huge volume of sequencing reads of short lengths obtained from the next generation sequencing technologies, the need of efficient statistical tools to accomplish this challenge is increasing. We proposed a two-step procedure to perform functional analysis on a metagenome: mixture model coupled with the adjustment of multiple role assignment, to accurately assign reads to related functional roles by utilizing the SEED classification. Though this research is initiated for the SEED classification, actually the proposed method can be generalized to any type of function annotation system.</p>
<p>Compared to MEGAN and MG-RAST through comprehensive simulation studies, our procedure metaFunction demonstrates more effective in assigning reads to functional roles, thereafter, to subsystems. In the simulation study 1 and 2, the results show that MEGAN cannot assign any read to one of the true functional roles (<xref ref-type="fig" rid="pone-0106588-g004">Figure 4</xref>
) while in the simulation study 3 and 4, MG-RAST cannot assign any read to one of the true functional roles (plot not shown). This type of phenomenon has never happened to our approach. In addition, the proposed method can correctly assign higher percentage of reads to functional roles than MEGAN does. MEGAN utilizes the best bit-score for assignment. If a read returns with best scores for multiple functions in the BLAST output, then only the first function (alphabetically) is chosen for the assignment. In our method all of them with the same best score are assigned to the read. Different from other existing methods, the proposed method provides confidence intervals for the estimations of the proportions by using bootstrap.</p>
<p>We also applied the proposed method to two real metagenomic datasets and our results generally are consistent with the findings in the previous reports but provide more detailed information. A future work is to integrate the taxonomic analysis and functional analysis, in other words, to consider these two types of issues simultaneously, so that the power can be improved for both taxonomic and functional profiling a metagenomic sample.</p>
</sec>
<sec sec-type="supplementary-material" id="s5"><title>Supporting Information</title>
<supplementary-material content-type="local-data" id="pone.0106588.s001"><label>Figure S1</label>
<caption><p><bold>Proportions of the detected subsystems (level 3) by MG-RAST and metaFunction for the lake data.</bold>
 The top 66 subsystems with proportion >0.005 in at least one of samples are listed. The “error” bars represent the 95% confidence interval obtained by bootstrap method. Note: only the proposed approach can provide confidence intervals for the estimations of the proportions.</p>
<p>(TIF)</p>
</caption>
<media xlink:href="pone.0106588.s001.tif"><caption><p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0106588.s002"><label>File S1</label>
<caption><p>Table S1. Number of short reads generated from 10 primary function roles for the studies of 1, 2, and 5. The function names in italic are secondary functions.</p>
<p>Table S2. Number of short reads generated from 10 primary function roles for the studies of 3, 4, and 6. The function names in italic are secondary functions.</p>
<p>(DOCX)</p>
</caption>
<media xlink:href="pone.0106588.s002.docx"><caption><p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back><ack><p>The authors would like to thank Dr. Zhenqiang Lu for computational assistance.</p>
</ack>
<ref-list><title>References</title>
<ref id="pone.0106588-Huson1"><label>1</label>
<mixed-citation publication-type="journal"><name><surname>Huson</surname>
<given-names>DH</given-names>
</name>
, <name><surname>Auch</surname>
<given-names>AF</given-names>
</name>
, <name><surname>Qi</surname>
<given-names>J</given-names>
</name>
, <name><surname>Schuster</surname>
<given-names>SC</given-names>
</name>
 (<year>2007</year>
) <article-title>MEGAN analysis of metagenomic data</article-title>
. <source>Genome Res</source>
<volume>17</volume>
: <fpage>377</fpage>
–<lpage>386</lpage>
.<pub-id pub-id-type="pmid">17255551</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0106588-Rosen1"><label>2</label>
<mixed-citation publication-type="journal"><name><surname>Rosen</surname>
<given-names>GL</given-names>
</name>
, <name><surname>Sokhansanj</surname>
<given-names>BA</given-names>
</name>
, <name><surname>Polikar</surname>
<given-names>R</given-names>
</name>
, <name><surname>Bruns</surname>
<given-names>MA</given-names>
</name>
, <name><surname>Russell</surname>
<given-names>J</given-names>
</name>
, <etal>et al</etal>
 (<year>2009</year>
) <article-title>Signal Processing for Metagenomics: Extracting Information from the Soup</article-title>
. <source>Current Genomics</source>
<volume>10</volume>
: <fpage>493</fpage>
–<lpage>510</lpage>
.<pub-id pub-id-type="pmid">20436876</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0106588-Mardis1"><label>3</label>
<mixed-citation publication-type="journal"><name><surname>Mardis</surname>
<given-names>ER</given-names>
</name>
 (<year>2008</year>
) <article-title>Next-generation DNA sequencing methods</article-title>
. <source>Annual Review of Genomics and Human Genetics</source>
<volume>9</volume>
: <fpage>387</fpage>
–<lpage>402</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0106588-Clemente1"><label>4</label>
<mixed-citation publication-type="journal">Clemente JC, Jansson J, Valiente G (2011) Flexible taxonomic assignment of ambiguous sequencing reads. BMC Bioinformatics <volume>12</volume>
..</mixed-citation>
</ref>
<ref id="pone.0106588-Meinicke1"><label>5</label>
<mixed-citation publication-type="journal"><name><surname>Meinicke</surname>
<given-names>P</given-names>
</name>
, <name><surname>Asshauer</surname>
<given-names>KP</given-names>
</name>
, <name><surname>Lingner</surname>
<given-names>T</given-names>
</name>
 (<year>2011</year>
) <article-title>Mixture models for analysis of the taxonomic composition of metagenomes</article-title>
. <source>Bioinformatics</source>
<volume>27</volume>
: <fpage>1618</fpage>
–<lpage>1624</lpage>
.<pub-id pub-id-type="pmid">21546400</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0106588-Huson2"><label>6</label>
<mixed-citation publication-type="journal"><name><surname>Huson</surname>
<given-names>DH</given-names>
</name>
, <name><surname>Mitra</surname>
<given-names>S</given-names>
</name>
, <name><surname>Ruscheweyh</surname>
<given-names>HJ</given-names>
</name>
, <name><surname>Weber</surname>
<given-names>N</given-names>
</name>
, <name><surname>Schuster</surname>
<given-names>SC</given-names>
</name>
 (<year>2011</year>
) <article-title>Integrative analysis of environmental sequences using MEGAN4</article-title>
. <source>Genome Res</source>
<volume>21</volume>
: <fpage>1552</fpage>
–<lpage>1560</lpage>
.<pub-id pub-id-type="pmid">21690186</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0106588-Xia1"><label>7</label>
<mixed-citation publication-type="journal"><name><surname>Xia</surname>
<given-names>LC</given-names>
</name>
, <name><surname>Cram</surname>
<given-names>JA</given-names>
</name>
, <name><surname>Chen</surname>
<given-names>T</given-names>
</name>
, <name><surname>Fuhrman</surname>
<given-names>JA</given-names>
</name>
, <name><surname>Sun</surname>
<given-names>F</given-names>
</name>
 (<year>2011</year>
) <article-title>Accurate genome relative abundance estimation based on shotgun metagenomic reads</article-title>
. <source>Plos One</source>
<volume>6</volume>
: <fpage>e27992</fpage>
.<pub-id pub-id-type="pmid">22162995</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0106588-Jiang1"><label>8</label>
<mixed-citation publication-type="journal"><name><surname>Jiang</surname>
<given-names>H</given-names>
</name>
, <name><surname>An</surname>
<given-names>L</given-names>
</name>
, <name><surname>Lin</surname>
<given-names>SM</given-names>
</name>
, <name><surname>Feng</surname>
<given-names>G</given-names>
</name>
, <name><surname>Qiu</surname>
<given-names>Y</given-names>
</name>
 (<year>2012</year>
) <article-title>A statistical framework for accurate taxonomic assignment of metagenomic sequencing reads</article-title>
. <source>Plos One</source>
<volume>7</volume>
: <fpage>e46450</fpage>
.<pub-id pub-id-type="pmid">23049702</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0106588-Lindner1"><label>9</label>
<mixed-citation publication-type="journal"><name><surname>Lindner</surname>
<given-names>MS</given-names>
</name>
, <name><surname>Renard</surname>
<given-names>BY</given-names>
</name>
 (<year>2013</year>
) <article-title>Metagenomic abundance estimation and diagnostic testing on species level</article-title>
. <source>Nucleic Acids Research</source>
<volume>41</volume>
: <fpage>e10</fpage>
.<pub-id pub-id-type="pmid">22941661</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0106588-Mea1"><label>10</label>
<mixed-citation publication-type="journal">Sohn Mea (2014) Accurate genome relative abundance estimation for closely related species in a metagenomic sample. BMC Bioinformatics <volume>15</volume>
..</mixed-citation>
</ref>
<ref id="pone.0106588-Overbeek1"><label>11</label>
<mixed-citation publication-type="journal"><name><surname>Overbeek</surname>
<given-names>R</given-names>
</name>
, <name><surname>Begley</surname>
<given-names>T</given-names>
</name>
, <name><surname>Butler</surname>
<given-names>RM</given-names>
</name>
, <name><surname>Choudhuri</surname>
<given-names>JV</given-names>
</name>
, <name><surname>Chuang</surname>
<given-names>HY</given-names>
</name>
, <etal>et al</etal>
 (<year>2005</year>
) <article-title>The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes</article-title>
. <source>Nucleic Acids Research</source>
<volume>33</volume>
: <fpage>5691</fpage>
–<lpage>5702</lpage>
.<pub-id pub-id-type="pmid">16214803</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0106588-Dinsdale1"><label>12</label>
<mixed-citation publication-type="journal"><name><surname>Dinsdale</surname>
<given-names>EA</given-names>
</name>
, <name><surname>Edwards</surname>
<given-names>RA</given-names>
</name>
, <name><surname>Hall</surname>
<given-names>D</given-names>
</name>
, <name><surname>Angly</surname>
<given-names>F</given-names>
</name>
, <name><surname>Breitbart</surname>
<given-names>M</given-names>
</name>
, <etal>et al</etal>
 (<year>2008</year>
) <article-title>Functional metagenomic profiling of nine biomes</article-title>
. <source>Nature</source>
<volume>452</volume>
: <fpage>629</fpage>
–<lpage>632</lpage>
.<pub-id pub-id-type="pmid">18337718</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0106588-Parks1"><label>13</label>
<mixed-citation publication-type="journal"><name><surname>Parks</surname>
<given-names>DH</given-names>
</name>
, <name><surname>Beiko</surname>
<given-names>RG</given-names>
</name>
 (<year>2010</year>
) <article-title>Identifying biologically relevant differences between metagenomic communities</article-title>
. <source>Bioinformatics</source>
<volume>26</volume>
: <fpage>715</fpage>
–<lpage>721</lpage>
.<pub-id pub-id-type="pmid">20130030</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0106588-Sharon1"><label>14</label>
<mixed-citation publication-type="journal"><name><surname>Sharon</surname>
<given-names>I</given-names>
</name>
, <name><surname>Bercovici</surname>
<given-names>S</given-names>
</name>
, <name><surname>Pinter</surname>
<given-names>RY</given-names>
</name>
, <name><surname>Shlomi</surname>
<given-names>T</given-names>
</name>
 (<year>2011</year>
) <article-title>Pathway-based functional analysis of metagenomes</article-title>
. <source>J Comput Biol</source>
<volume>18</volume>
: <fpage>495</fpage>
–<lpage>505</lpage>
.<pub-id pub-id-type="pmid">21385050</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0106588-Yooseph1"><label>15</label>
<mixed-citation publication-type="journal">Yooseph S, Li WZ, Sutton G (2008) Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering. BMC Bioinformatics <volume>9</volume>
..</mixed-citation>
</ref>
<ref id="pone.0106588-Prakash1"><label>16</label>
<mixed-citation publication-type="journal"><name><surname>Prakash</surname>
<given-names>T</given-names>
</name>
, <name><surname>Taylor</surname>
<given-names>TD</given-names>
</name>
 (<year>2012</year>
) <article-title>Functional assignment of metagenomic data: challenges and applications</article-title>
. <source>Brief Bioinform</source>
<volume>13</volume>
: <fpage>711</fpage>
–<lpage>727</lpage>
.<pub-id pub-id-type="pmid">22772835</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0106588-Mitra1"><label>17</label>
<mixed-citation publication-type="journal">Mitra S, Rupek P, Richter DC, Urich T, Gilbert JA, <etal>et al</etal>
. (2011) Functional analysis of metagenomes and metatranscriptomes using SEED and KEGG. BMC Bioinformatics <volume>12</volume>
..</mixed-citation>
</ref>
<ref id="pone.0106588-Meyer1"><label>18</label>
<mixed-citation publication-type="journal">Meyer F, Paarmann D, D'Souza M, Olson R, Glass EM, <etal>et al</etal>
. (2008) The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics <volume>9</volume>
..</mixed-citation>
</ref>
<ref id="pone.0106588-Markowitz1"><label>19</label>
<mixed-citation publication-type="other">Markowitz VM, Chen IM, Chu K, Szeto E, Palaniappan K, <etal>et al</etal>
.. (2011) IMG/M: the integrated metagenome data management and comparative analysis system. Nucleic Acids Research.</mixed-citation>
</ref>
<ref id="pone.0106588-Sun1"><label>20</label>
<mixed-citation publication-type="journal"><name><surname>Sun</surname>
<given-names>SL</given-names>
</name>
, <name><surname>Chen</surname>
<given-names>J</given-names>
</name>
, <name><surname>Li</surname>
<given-names>WZ</given-names>
</name>
, <name><surname>Altintas</surname>
<given-names>I</given-names>
</name>
, <name><surname>Lin</surname>
<given-names>A</given-names>
</name>
, <etal>et al</etal>
 (<year>2011</year>
) <article-title>Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource</article-title>
. <source>Nucleic Acids Research</source>
<volume>39</volume>
: <fpage>D546</fpage>
–<lpage>D551</lpage>
.<pub-id pub-id-type="pmid">21045053</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0106588-Hoff1"><label>21</label>
<mixed-citation publication-type="journal"><name><surname>Hoff</surname>
<given-names>KJ</given-names>
</name>
, <name><surname>Lingner</surname>
<given-names>T</given-names>
</name>
, <name><surname>Meinicke</surname>
<given-names>P</given-names>
</name>
, <name><surname>Tech</surname>
<given-names>M</given-names>
</name>
 (<year>2009</year>
) <article-title>Orphelia: predicting genes in metagenomic sequencing reads</article-title>
. <source>Nucleic Acids Research</source>
<volume>37</volume>
: <fpage>W101</fpage>
–<lpage>W105</lpage>
.<pub-id pub-id-type="pmid">19429689</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0106588-APea1"><label>22</label>
<mixed-citation publication-type="journal"><name><surname>APea</surname>
<given-names>Dempster</given-names>
</name>
 (<year>1977</year>
) <article-title>Maximum Likelihood from Incomplete Data via the EM Algorithm</article-title>
. <source>Journal of the Royal Statistical Society Series B (Methodological)</source>
<volume>39</volume>
: <fpage>38</fpage>
.</mixed-citation>
</ref>
<ref id="pone.0106588-Efron1"><label>23</label>
<mixed-citation publication-type="other">Efron B (1982) The jackknife, the bootstrap, and other resampling plans.</mixed-citation>
</ref>
<ref id="pone.0106588-Miller1"><label>24</label>
<mixed-citation publication-type="other">Miller RGJ (1981) Simultaneous Statistical Inference. New York: Springer.</mixed-citation>
</ref>
<ref id="pone.0106588-Richter1"><label>25</label>
<mixed-citation publication-type="journal">Richter DC, Ott F, Auch AF, Schmid R, Huson DH (2008) MetaSim-A Sequencing Simulator for Genomics and Metagenomics. Plos One <volume>3</volume>
..</mixed-citation>
</ref>
<ref id="pone.0106588-Engeman1"><label>26</label>
<mixed-citation publication-type="journal"><name><surname>Engeman</surname>
<given-names>RM</given-names>
</name>
, <name><surname>Sugihara</surname>
<given-names>RT</given-names>
</name>
, <name><surname>Pank</surname>
<given-names>LF</given-names>
</name>
, <name><surname>Dusenberry</surname>
<given-names>WE</given-names>
</name>
 (<year>1994</year>
) <article-title>A Comparison of Plotless Density Estimators Using Monte-Carlo Simulation</article-title>
. <source>Ecology</source>
<volume>75</volume>
: <fpage>1769</fpage>
–<lpage>1779</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0106588-Steffen1"><label>27</label>
<mixed-citation publication-type="journal"><name><surname>Steffen</surname>
<given-names>MM</given-names>
</name>
, <name><surname>Li</surname>
<given-names>Z</given-names>
</name>
, <name><surname>Effler</surname>
<given-names>TC</given-names>
</name>
, <name><surname>Hauser</surname>
<given-names>LJ</given-names>
</name>
, <name><surname>Boyer</surname>
<given-names>GL</given-names>
</name>
, <etal>et al</etal>
 (<year>2012</year>
) <article-title>Comparative metagenomics of toxic freshwater cyanobacteria bloom communities on two continents</article-title>
. <source>Plos One</source>
<volume>7</volume>
: <fpage>e44002</fpage>
.<pub-id pub-id-type="pmid">22952848</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0106588-BeldaFerre1"><label>28</label>
<mixed-citation publication-type="other">Belda-Ferre P, Alcaraz LD, Cabrera-Rubio R, Romero H, Simon-Soro A, <etal>et al</etal>
.. (2011) The oral metagenome in health and disease. ISME J.</mixed-citation>
</ref>
<ref id="pone.0106588-Ready1"><label>29</label>
<mixed-citation publication-type="journal"><name><surname>Ready</surname>
<given-names>D</given-names>
</name>
, <name><surname>Pratten</surname>
<given-names>J</given-names>
</name>
, <name><surname>Roberts</surname>
<given-names>AP</given-names>
</name>
, <name><surname>Bedi</surname>
<given-names>R</given-names>
</name>
, <name><surname>Mullany</surname>
<given-names>P</given-names>
</name>
, <etal>et al</etal>
 (<year>2006</year>
) <article-title>Potential role of Veillonella spp. as a reservoir of transferable tetracycline resistance in the oral cavity</article-title>
. <source>Antimicrobial Agents and Chemotherapy</source>
<volume>50</volume>
: <fpage>2866</fpage>
–<lpage>2868</lpage>
.<pub-id pub-id-type="pmid">16870789</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0106588-Fozo1"><label>30</label>
<mixed-citation publication-type="journal"><name><surname>Fozo</surname>
<given-names>EM</given-names>
</name>
, <name><surname>Scott-Anne</surname>
<given-names>K</given-names>
</name>
, <name><surname>Koo</surname>
<given-names>H</given-names>
</name>
, <name><surname>Quivey</surname>
<given-names>RG</given-names>
</name>
 (<year>2007</year>
) <article-title>Role of unsaturated fatty acid biosynthesis in virulence of Streptococcus mutans</article-title>
. <source>Infection and Immunity</source>
<volume>75</volume>
: <fpage>1537</fpage>
–<lpage>1539</lpage>
.<pub-id pub-id-type="pmid">17220314</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0106588-Seshadri1"><label>31</label>
<mixed-citation publication-type="journal"><name><surname>Seshadri</surname>
<given-names>G</given-names>
</name>
, <name><surname>Myers</surname>
<given-names>GSA</given-names>
</name>
, <name><surname>Tettelin</surname>
<given-names>H</given-names>
</name>
, <name><surname>Eisen</surname>
<given-names>JA</given-names>
</name>
, <name><surname>Heidelberg</surname>
<given-names>JF</given-names>
</name>
, <etal>et al</etal>
 (<year>2004</year>
) <article-title>Comparison of the genome Treponema denticola with of the oral pathogen other spirochete genomes</article-title>
. <source>Proc Natl Acad Sci U S A</source>
<volume>101</volume>
: <fpage>5646</fpage>
–<lpage>5651</lpage>
.<pub-id pub-id-type="pmid">15064399</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0106588-Park1"><label>32</label>
<mixed-citation publication-type="journal"><name><surname>Park</surname>
<given-names>SN</given-names>
</name>
, <name><surname>Kong</surname>
<given-names>SW</given-names>
</name>
, <name><surname>Kim</surname>
<given-names>HS</given-names>
</name>
, <name><surname>Park</surname>
<given-names>MS</given-names>
</name>
, <name><surname>Lee</surname>
<given-names>JW</given-names>
</name>
, <etal>et al</etal>
 (<year>2012</year>
) <article-title>Draft Genome Sequence of Fusobacterium nucleatum ChDC F128, Isolated from a Periodontitis Lesion</article-title>
. <source>Journal of Bacteriology</source>
<volume>194</volume>
: <fpage>6322</fpage>
–<lpage>6323</lpage>
.<pub-id pub-id-type="pmid">23105064</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0106588-Yoshimura1"><label>33</label>
<mixed-citation publication-type="journal"><name><surname>Yoshimura</surname>
<given-names>M</given-names>
</name>
, <name><surname>Nakano</surname>
<given-names>Y</given-names>
</name>
, <name><surname>Yamashita</surname>
<given-names>Y</given-names>
</name>
, <name><surname>Oho</surname>
<given-names>T</given-names>
</name>
, <name><surname>Saito</surname>
<given-names>T</given-names>
</name>
, <etal>et al</etal>
 (<year>2000</year>
) <article-title>Formation of methyl mercaptan from L-methionine by Porphyromonas gingivalis</article-title>
. <source>Infection and Immunity</source>
<volume>68</volume>
: <fpage>6912</fpage>
–<lpage>6916</lpage>
.<pub-id pub-id-type="pmid">11083813</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0106588-Karlsson1"><label>34</label>
<mixed-citation publication-type="journal">Karlsson C, Malmstrom L, Aebersold R, Malmstrom J (2012) Proteome-wide selected reaction monitoring assays for the human pathogen Streptococcus pyogenes. Nature Communications <volume>3</volume>
..</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/CyberinfraV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000086 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000086 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    CyberinfraV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:4157783
   |texte=   Statistical Approach of Functional Profiling for a Microbial Community
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:25198674" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a CyberinfraV1

This area was generated with Dilib version V0.6.25.
Data generation: Thu Oct 27 09:30:58 2016. Site generation: Sun Mar 10 23:08:40 2024

	Serveur d'exploration Cyberinfrastructure
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration Cyberinfrastructure

Statistical Approach of Functional Profiling for a Microbial Community

Statistical Approach of Functional Profiling for a Microbial Community

Source :

Abstract

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki