MersV1, Pmc, Corpus, bibRecord, 000C80

***** Acces problem to record *****\

Identifieur interne : 000C80 ( Pmc/Corpus ); précédent : 000C799; suivant : 000C810 ***** probable Xml problem with record *****

Links to Exploration step

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Identifying <italic>Group-Specific</italic>
 Sequences for Microbial Communities Using Long <italic>k</italic>
-mer Sequence Signatures</title>
<author><name sortKey="Wang, Ying" sort="Wang, Ying" uniqKey="Wang Y" first="Ying" last="Wang">Ying Wang</name>
<affiliation><nlm:aff id="aff1"><institution>Department of Automation, Xiamen University</institution>
,<addr-line>Xiamen</addr-line>
,<country>China</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Fu, Lei" sort="Fu, Lei" uniqKey="Fu L" first="Lei" last="Fu">Lei Fu</name>
<affiliation><nlm:aff id="aff1"><institution>Department of Automation, Xiamen University</institution>
,<addr-line>Xiamen</addr-line>
,<country>China</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Ren, Jie" sort="Ren, Jie" uniqKey="Ren J" first="Jie" last="Ren">Jie Ren</name>
<affiliation><nlm:aff id="aff2"><institution>Molecular and Computational Biology Program, University of Southern California, Los Angeles</institution>
,<addr-line>CA</addr-line>
,<country>United States</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Yu, Zhaoxia" sort="Yu, Zhaoxia" uniqKey="Yu Z" first="Zhaoxia" last="Yu">Zhaoxia Yu</name>
<affiliation><nlm:aff id="aff3"><institution>Department of Statistics, University of California, Irvine</institution>
,<addr-line>Irvine, CA</addr-line>
,<country>United States</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Chen, Ting" sort="Chen, Ting" uniqKey="Chen T" first="Ting" last="Chen">Ting Chen</name>
<affiliation><nlm:aff id="aff2"><institution>Molecular and Computational Biology Program, University of Southern California, Los Angeles</institution>
,<addr-line>CA</addr-line>
,<country>United States</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="aff4"><institution>Bioinformatics Division, Tsinghua National Laboratory of Information Science and Technology, Tsinghua University</institution>
,<addr-line>Beijing</addr-line>
,<country>China</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="aff5"><institution>Department of Computer Science and Technology, Tsinghua University</institution>
,<addr-line>Beijing</addr-line>
,<country>China</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Sun, Fengzhu" sort="Sun, Fengzhu" uniqKey="Sun F" first="Fengzhu" last="Sun">Fengzhu Sun</name>
<affiliation><nlm:aff id="aff2"><institution>Molecular and Computational Biology Program, University of Southern California, Los Angeles</institution>
,<addr-line>CA</addr-line>
,<country>United States</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="aff6"><institution>Center for Computational Systems Biology, Fudan University</institution>
,<addr-line>Shanghai</addr-line>
,<country>China</country>
</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">29774017</idno>
<idno type="pmc">5943621</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5943621</idno>
<idno type="RBID">PMC:5943621</idno>
<idno type="doi">10.3389/fmicb.2018.00872</idno>
<date when="2018">2018</date>
<idno type="wicri:Area/Pmc/Corpus">000C80</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000C80</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Identifying <italic>Group-Specific</italic>
 Sequences for Microbial Communities Using Long <italic>k</italic>
-mer Sequence Signatures</title>
<author><name sortKey="Wang, Ying" sort="Wang, Ying" uniqKey="Wang Y" first="Ying" last="Wang">Ying Wang</name>
<affiliation><nlm:aff id="aff1"><institution>Department of Automation, Xiamen University</institution>
,<addr-line>Xiamen</addr-line>
,<country>China</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Fu, Lei" sort="Fu, Lei" uniqKey="Fu L" first="Lei" last="Fu">Lei Fu</name>
<affiliation><nlm:aff id="aff1"><institution>Department of Automation, Xiamen University</institution>
,<addr-line>Xiamen</addr-line>
,<country>China</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Ren, Jie" sort="Ren, Jie" uniqKey="Ren J" first="Jie" last="Ren">Jie Ren</name>
<affiliation><nlm:aff id="aff2"><institution>Molecular and Computational Biology Program, University of Southern California, Los Angeles</institution>
,<addr-line>CA</addr-line>
,<country>United States</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Yu, Zhaoxia" sort="Yu, Zhaoxia" uniqKey="Yu Z" first="Zhaoxia" last="Yu">Zhaoxia Yu</name>
<affiliation><nlm:aff id="aff3"><institution>Department of Statistics, University of California, Irvine</institution>
,<addr-line>Irvine, CA</addr-line>
,<country>United States</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Chen, Ting" sort="Chen, Ting" uniqKey="Chen T" first="Ting" last="Chen">Ting Chen</name>
<affiliation><nlm:aff id="aff2"><institution>Molecular and Computational Biology Program, University of Southern California, Los Angeles</institution>
,<addr-line>CA</addr-line>
,<country>United States</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="aff4"><institution>Bioinformatics Division, Tsinghua National Laboratory of Information Science and Technology, Tsinghua University</institution>
,<addr-line>Beijing</addr-line>
,<country>China</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="aff5"><institution>Department of Computer Science and Technology, Tsinghua University</institution>
,<addr-line>Beijing</addr-line>
,<country>China</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Sun, Fengzhu" sort="Sun, Fengzhu" uniqKey="Sun F" first="Fengzhu" last="Sun">Fengzhu Sun</name>
<affiliation><nlm:aff id="aff2"><institution>Molecular and Computational Biology Program, University of Southern California, Los Angeles</institution>
,<addr-line>CA</addr-line>
,<country>United States</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="aff6"><institution>Center for Computational Systems Biology, Fudan University</institution>
,<addr-line>Shanghai</addr-line>
,<country>China</country>
</nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">Frontiers in Microbiology</title>
<idno type="eISSN">1664-302X</idno>
<imprint><date when="2018">2018</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><p>Comparing metagenomic samples is crucial for understanding microbial communities. For different groups of microbial communities, such as human gut metagenomic samples from patients with a certain disease and healthy controls, identifying <italic>group-specific</italic>
 sequences offers essential information for potential biomarker discovery. A sequence that is present, or rich, in one group, but absent, or scarce, in another group is considered “<italic>group-specific</italic>
” in our study. Our main purpose is to discover <italic>group-specific</italic>
 sequence regions between control and case groups as disease-associated markers. We developed a long <italic>k</italic>
-mer (<italic>k</italic>
 ≥ 30 bps)-based computational pipeline to detect <italic>group-specific</italic>
 sequences at strain resolution free from reference sequences, sequence alignments, and metagenome-wide <italic>de novo</italic>
 assembly. We called our method MetaGO: <italic>Group-specific</italic>
 oligonucleotide analysis for metagenomic samples. An open-source pipeline on <italic>Apache Spark</italic>
 was developed with parallel computing. We applied MetaGO to one simulated and three real metagenomic datasets to evaluate the discriminative capability of identified <italic>group-specific</italic>
 markers. In the simulated dataset, 99.11% of <italic>group-specific</italic>
 logical <italic>40</italic>
-mers covered 98.89% <italic>disease-specific</italic>
 regions from the disease-associated strain. In addition, 97.90% of <italic>group-specific</italic>
 numerical <italic>40</italic>
-mers covered 99.61 and 96.39% of differentially abundant genome and regions between two groups, respectively. For a large-scale metagenomic liver cirrhosis (LC)-associated dataset, we identified 37,647 <italic>group-specific 40-</italic>
mer features. Any one of the features can predict disease status of the training samples with the average of sensitivity and specificity higher than 0.8. The random forests classification using the top 10 <italic>group-specific</italic>
 features yielded a higher AUC (from ∼0.8 to ∼0.9) than that of previous studies. All <italic>group-specific 40-</italic>
mers were present in LC patients, but not healthy controls. All the assembled 11 <italic>LC-specific</italic>
 sequences can be mapped to two strains of <italic>Veillonella parvula</italic>
: UTDB1-3 and DSM2008. The experiments on the other two real datasets related to Inflammatory Bowel Disease and Type 2 Diabetes in Women consistently demonstrated that MetaGO achieved better prediction accuracy with fewer features compared to previous studies. The experiments showed that MetaGO is a powerful tool for identifying <italic>group-specific k</italic>
-mers, which would be clinically applicable for disease prediction. MetaGO is available at <ext-link ext-link-type="uri" xlink:href="https://github.com/VVsmileyx/MetaGO">https://github.com/VVsmileyx/MetaGO</ext-link>
.</p>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Alneberg, J" uniqKey="Alneberg J">J. Alneberg</name>
</author>
<author><name sortKey="Bjarnason, B S" uniqKey="Bjarnason B">B. S. Bjarnason</name>
</author>
<author><name sortKey="De Bruijn, I" uniqKey="De Bruijn I">I. De Bruijn</name>
</author>
<author><name sortKey="Schirmer, M" uniqKey="Schirmer M">M. Schirmer</name>
</author>
<author><name sortKey="Quick, J" uniqKey="Quick J">J. Quick</name>
</author>
<author><name sortKey="Ijaz, U Z" uniqKey="Ijaz U">U. Z. Ijaz</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Altschul, S F" uniqKey="Altschul S">S. F. Altschul</name>
</author>
<author><name sortKey="Madden, T L" uniqKey="Madden T">T. L. Madden</name>
</author>
<author><name sortKey="Sch Ffer, A A" uniqKey="Sch Ffer A">A. A. Schäffer</name>
</author>
<author><name sortKey="Zhang, J" uniqKey="Zhang J">J. Zhang</name>
</author>
<author><name sortKey="Zhang, Z" uniqKey="Zhang Z">Z. Zhang</name>
</author>
<author><name sortKey="Miller, W" uniqKey="Miller W">W. Miller</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Benoit, G" uniqKey="Benoit G">G. Benoit</name>
</author>
<author><name sortKey="Peterlongo, P" uniqKey="Peterlongo P">P. Peterlongo</name>
</author>
<author><name sortKey="Mariadassou, M" uniqKey="Mariadassou M">M. Mariadassou</name>
</author>
<author><name sortKey="Drezen, E" uniqKey="Drezen E">E. Drezen</name>
</author>
<author><name sortKey="Schbath, S" uniqKey="Schbath S">S. Schbath</name>
</author>
<author><name sortKey="Lavenier, D" uniqKey="Lavenier D">D. Lavenier</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Breiman, L" uniqKey="Breiman L">L. Breiman</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Costello, E K" uniqKey="Costello E">E. K. Costello</name>
</author>
<author><name sortKey="Lauber, C L" uniqKey="Lauber C">C. L. Lauber</name>
</author>
<author><name sortKey="Hamady, M" uniqKey="Hamady M">M. Hamady</name>
</author>
<author><name sortKey="Fierer, N" uniqKey="Fierer N">N. Fierer</name>
</author>
<author><name sortKey="Gordon, J I" uniqKey="Gordon J">J. I. Gordon</name>
</author>
<author><name sortKey="Knight, R" uniqKey="Knight R">R. Knight</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Cui, H" uniqKey="Cui H">H. Cui</name>
</author>
<author><name sortKey="Zhang, X" uniqKey="Zhang X">X. Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Feng, Q" uniqKey="Feng Q">Q. Feng</name>
</author>
<author><name sortKey="Liang, S" uniqKey="Liang S">S. Liang</name>
</author>
<author><name sortKey="Jia, H" uniqKey="Jia H">H. Jia</name>
</author>
<author><name sortKey="Stadlmayr, A" uniqKey="Stadlmayr A">A. Stadlmayr</name>
</author>
<author><name sortKey="Tang, L" uniqKey="Tang L">L. Tang</name>
</author>
<author><name sortKey="Lan, Z" uniqKey="Lan Z">Z. Lan</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Fofanov, Y" uniqKey="Fofanov Y">Y. Fofanov</name>
</author>
<author><name sortKey="Luo, Y" uniqKey="Luo Y">Y. Luo</name>
</author>
<author><name sortKey="Katili, C" uniqKey="Katili C">C. Katili</name>
</author>
<author><name sortKey="Wang, J" uniqKey="Wang J">J. Wang</name>
</author>
<author><name sortKey="Belosludtsev, Y" uniqKey="Belosludtsev Y">Y. Belosludtsev</name>
</author>
<author><name sortKey="Powdrill, T" uniqKey="Powdrill T">T. Powdrill</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Grabherr, M G" uniqKey="Grabherr M">M. G. Grabherr</name>
</author>
<author><name sortKey="Haas, B J" uniqKey="Haas B">B. J. Haas</name>
</author>
<author><name sortKey="Yassour, M" uniqKey="Yassour M">M. Yassour</name>
</author>
<author><name sortKey="Levin, J Z" uniqKey="Levin J">J. Z. Levin</name>
</author>
<author><name sortKey="Thompson, D A" uniqKey="Thompson D">D. A. Thompson</name>
</author>
<author><name sortKey="Amit, I" uniqKey="Amit I">I. Amit</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Han, W" uniqKey="Han W">W. Han</name>
</author>
<author><name sortKey="Wang, M" uniqKey="Wang M">M. Wang</name>
</author>
<author><name sortKey="Ye, Y" uniqKey="Ye Y">Y. Ye</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Huang, X" uniqKey="Huang X">X. Huang</name>
</author>
<author><name sortKey="Madan, A" uniqKey="Madan A">A. Madan</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Jiang, B" uniqKey="Jiang B">B. Jiang</name>
</author>
<author><name sortKey="Song, K" uniqKey="Song K">K. Song</name>
</author>
<author><name sortKey="Ren, J" uniqKey="Ren J">J. Ren</name>
</author>
<author><name sortKey="Deng, M" uniqKey="Deng M">M. Deng</name>
</author>
<author><name sortKey="Sun, F" uniqKey="Sun F">F. Sun</name>
</author>
<author><name sortKey="Zhang, X" uniqKey="Zhang X">X. Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Jiang, R" uniqKey="Jiang R">R. Jiang</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Karlsson, F H" uniqKey="Karlsson F">F. H. Karlsson</name>
</author>
<author><name sortKey="Tremaroli, V" uniqKey="Tremaroli V">V. Tremaroli</name>
</author>
<author><name sortKey="Nookaew, I" uniqKey="Nookaew I">I. Nookaew</name>
</author>
<author><name sortKey="Bergstrom, G" uniqKey="Bergstrom G">G. Bergström</name>
</author>
<author><name sortKey="Behre, C J" uniqKey="Behre C">C. J. Behre</name>
</author>
<author><name sortKey="Fagerberg, B" uniqKey="Fagerberg B">B. Fagerberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kunin, V" uniqKey="Kunin V">V. Kunin</name>
</author>
<author><name sortKey="Copeland, A" uniqKey="Copeland A">A. Copeland</name>
</author>
<author><name sortKey="Lapidus, A" uniqKey="Lapidus A">A. Lapidus</name>
</author>
<author><name sortKey="Mavromatis, K" uniqKey="Mavromatis K">K. Mavromatis</name>
</author>
<author><name sortKey="Hugenholtz, P" uniqKey="Hugenholtz P">P. Hugenholtz</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Le, V V" uniqKey="Le V">V. V. Le</name>
</author>
<author><name sortKey="Lang, T V" uniqKey="Lang T">T. V. Lang</name>
</author>
<author><name sortKey="Le, T B" uniqKey="Le T">T. B. Le</name>
</author>
<author><name sortKey="Hoai, T V" uniqKey="Hoai T">T. V. Hoai</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Li, D" uniqKey="Li D">D. Li</name>
</author>
<author><name sortKey="Liu, C M" uniqKey="Liu C">C.-M. Liu</name>
</author>
<author><name sortKey="Luo, R" uniqKey="Luo R">R. Luo</name>
</author>
<author><name sortKey="Sadakane, K" uniqKey="Sadakane K">K. Sadakane</name>
</author>
<author><name sortKey="Lam, T W" uniqKey="Lam T">T.-W. Lam</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Li, R" uniqKey="Li R">R. Li</name>
</author>
<author><name sortKey="Zhu, H" uniqKey="Zhu H">H. Zhu</name>
</author>
<author><name sortKey="Ruan, J" uniqKey="Ruan J">J. Ruan</name>
</author>
<author><name sortKey="Qian, W" uniqKey="Qian W">W. Qian</name>
</author>
<author><name sortKey="Fang, X" uniqKey="Fang X">X. Fang</name>
</author>
<author><name sortKey="Shi, Z" uniqKey="Shi Z">Z. Shi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Liao, W" uniqKey="Liao W">W. Liao</name>
</author>
<author><name sortKey="Ren, J" uniqKey="Ren J">J. Ren</name>
</author>
<author><name sortKey="Wang, K" uniqKey="Wang K">K. Wang</name>
</author>
<author><name sortKey="Wang, S" uniqKey="Wang S">S. Wang</name>
</author>
<author><name sortKey="Zeng, F" uniqKey="Zeng F">F. Zeng</name>
</author>
<author><name sortKey="Wang, Y" uniqKey="Wang Y">Y. Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lozupone, C A" uniqKey="Lozupone C">C. A. Lozupone</name>
</author>
<author><name sortKey="Stombaugh, J" uniqKey="Stombaugh J">J. Stombaugh</name>
</author>
<author><name sortKey="Gonzalez, A" uniqKey="Gonzalez A">A. Gonzalez</name>
</author>
<author><name sortKey="Ackermann, G" uniqKey="Ackermann G">G. Ackermann</name>
</author>
<author><name sortKey="Wendel, D" uniqKey="Wendel D">D. Wendel</name>
</author>
<author><name sortKey="Vazquez Baeza, Y" uniqKey="Vazquez Baeza Y">Y. Vázquez-Baeza</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lu, Y Y" uniqKey="Lu Y">Y. Y. Lu</name>
</author>
<author><name sortKey="Chen, T" uniqKey="Chen T">T. Chen</name>
</author>
<author><name sortKey="Fuhrman, J A" uniqKey="Fuhrman J">J. A. Fuhrman</name>
</author>
<author><name sortKey="Sun, F" uniqKey="Sun F">F. Sun</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Marcais, G" uniqKey="Marcais G">G. Marçais</name>
</author>
<author><name sortKey="Kingsford, C" uniqKey="Kingsford C">C. Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Nielsen, H B" uniqKey="Nielsen H">H. B. Nielsen</name>
</author>
<author><name sortKey="Almeida, M" uniqKey="Almeida M">M. Almeida</name>
</author>
<author><name sortKey="Juncker, A S" uniqKey="Juncker A">A. S. Juncker</name>
</author>
<author><name sortKey="Rasmussen, S" uniqKey="Rasmussen S">S. Rasmussen</name>
</author>
<author><name sortKey="Li, J" uniqKey="Li J">J. Li</name>
</author>
<author><name sortKey="Sunagawa, S" uniqKey="Sunagawa S">S. Sunagawa</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Papudeshi, B" uniqKey="Papudeshi B">B. Papudeshi</name>
</author>
<author><name sortKey="Haggerty, J M" uniqKey="Haggerty J">J. M. Haggerty</name>
</author>
<author><name sortKey="Doane, M" uniqKey="Doane M">M. Doane</name>
</author>
<author><name sortKey="Morris, M M" uniqKey="Morris M">M. M. Morris</name>
</author>
<author><name sortKey="Walsh, K" uniqKey="Walsh K">K. Walsh</name>
</author>
<author><name sortKey="Beattie, D T" uniqKey="Beattie D">D. T. Beattie</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Pasolli, E" uniqKey="Pasolli E">E. Pasolli</name>
</author>
<author><name sortKey="Schiffer, L" uniqKey="Schiffer L">L. Schiffer</name>
</author>
<author><name sortKey="Manghi, P" uniqKey="Manghi P">P. Manghi</name>
</author>
<author><name sortKey="Renson, A" uniqKey="Renson A">A. Renson</name>
</author>
<author><name sortKey="Obenchain, V" uniqKey="Obenchain V">V. Obenchain</name>
</author>
<author><name sortKey="Truong, D T" uniqKey="Truong D">D. T. Truong</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Pasolli, E" uniqKey="Pasolli E">E. Pasolli</name>
</author>
<author><name sortKey="Truong, D T" uniqKey="Truong D">D. T. Truong</name>
</author>
<author><name sortKey="Malik, F" uniqKey="Malik F">F. Malik</name>
</author>
<author><name sortKey="Waldron, L" uniqKey="Waldron L">L. Waldron</name>
</author>
<author><name sortKey="Segata, N" uniqKey="Segata N">N. Segata</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Paulus, W" uniqKey="Paulus W">W. Paulus</name>
</author>
<author><name sortKey="Jellinger, K" uniqKey="Jellinger K">K. Jellinger</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Qin, J" uniqKey="Qin J">J. Qin</name>
</author>
<author><name sortKey="Li, R" uniqKey="Li R">R. Li</name>
</author>
<author><name sortKey="Raes, J" uniqKey="Raes J">J. Raes</name>
</author>
<author><name sortKey="Arumugam, M" uniqKey="Arumugam M">M. Arumugam</name>
</author>
<author><name sortKey="Burgdorf, K S" uniqKey="Burgdorf K">K. S. Burgdorf</name>
</author>
<author><name sortKey="Manichanh, C" uniqKey="Manichanh C">C. Manichanh</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Qin, J" uniqKey="Qin J">J. Qin</name>
</author>
<author><name sortKey="Li, Y" uniqKey="Li Y">Y. Li</name>
</author>
<author><name sortKey="Cai, Z" uniqKey="Cai Z">Z. Cai</name>
</author>
<author><name sortKey="Li, S" uniqKey="Li S">S. Li</name>
</author>
<author><name sortKey="Zhu, J" uniqKey="Zhu J">J. Zhu</name>
</author>
<author><name sortKey="Zhang, F" uniqKey="Zhang F">F. Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Qin, N" uniqKey="Qin N">N. Qin</name>
</author>
<author><name sortKey="Yang, F" uniqKey="Yang F">F. Yang</name>
</author>
<author><name sortKey="Li, A" uniqKey="Li A">A. Li</name>
</author>
<author><name sortKey="Prifti, E" uniqKey="Prifti E">E. Prifti</name>
</author>
<author><name sortKey="Chen, Y" uniqKey="Chen Y">Y. Chen</name>
</author>
<author><name sortKey="Shao, L" uniqKey="Shao L">L. Shao</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Quast, C" uniqKey="Quast C">C. Quast</name>
</author>
<author><name sortKey="Pruesse, E" uniqKey="Pruesse E">E. Pruesse</name>
</author>
<author><name sortKey="Yilmaz, P" uniqKey="Yilmaz P">P. Yilmaz</name>
</author>
<author><name sortKey="Gerken, J" uniqKey="Gerken J">J. Gerken</name>
</author>
<author><name sortKey="Schweer, T" uniqKey="Schweer T">T. Schweer</name>
</author>
<author><name sortKey="Yarza, P" uniqKey="Yarza P">P. Yarza</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ren, J" uniqKey="Ren J">J. Ren</name>
</author>
<author><name sortKey="Ahlgren, N A" uniqKey="Ahlgren N">N. A. Ahlgren</name>
</author>
<author><name sortKey="Lu, Y Y" uniqKey="Lu Y">Y. Y. Lu</name>
</author>
<author><name sortKey="Fuhrman, J A" uniqKey="Fuhrman J">J. A. Fuhrman</name>
</author>
<author><name sortKey="Sun, F" uniqKey="Sun F">F. Sun</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Richter, D C" uniqKey="Richter D">D. C. Richter</name>
</author>
<author><name sortKey="Ott, F" uniqKey="Ott F">F. Ott</name>
</author>
<author><name sortKey="Auch, A F" uniqKey="Auch A">A. F. Auch</name>
</author>
<author><name sortKey="Schmid, R" uniqKey="Schmid R">R. Schmid</name>
</author>
<author><name sortKey="Huson, D H" uniqKey="Huson D">D. H. Huson</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Rizk, G" uniqKey="Rizk G">G. Rizk</name>
</author>
<author><name sortKey="Lavenier, D" uniqKey="Lavenier D">D. Lavenier</name>
</author>
<author><name sortKey="Chikhi, R" uniqKey="Chikhi R">R. Chikhi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sangwan, N" uniqKey="Sangwan N">N. Sangwan</name>
</author>
<author><name sortKey="Xia, F" uniqKey="Xia F">F. Xia</name>
</author>
<author><name sortKey="Gilbert, J A" uniqKey="Gilbert J">J. A. Gilbert</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sczyrba, A" uniqKey="Sczyrba A">A. Sczyrba</name>
</author>
<author><name sortKey="Hofmann, P" uniqKey="Hofmann P">P. Hofmann</name>
</author>
<author><name sortKey="Belmann, P" uniqKey="Belmann P">P. Belmann</name>
</author>
<author><name sortKey="Koslicki, D" uniqKey="Koslicki D">D. Koslicki</name>
</author>
<author><name sortKey="Janssen, S" uniqKey="Janssen S">S. Janssen</name>
</author>
<author><name sortKey="Droge, J" uniqKey="Droge J">J. Dröge</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Segata, N" uniqKey="Segata N">N. Segata</name>
</author>
<author><name sortKey="Izard, J" uniqKey="Izard J">J. Izard</name>
</author>
<author><name sortKey="Waldron, L" uniqKey="Waldron L">L. Waldron</name>
</author>
<author><name sortKey="Gevers, D" uniqKey="Gevers D">D. Gevers</name>
</author>
<author><name sortKey="Miropolsky, L" uniqKey="Miropolsky L">L. Miropolsky</name>
</author>
<author><name sortKey="Garrett, W S" uniqKey="Garrett W">W. S. Garrett</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wang, Y" uniqKey="Wang Y">Y. Wang</name>
</author>
<author><name sortKey="Lei, X" uniqKey="Lei X">X. Lei</name>
</author>
<author><name sortKey="Wang, S" uniqKey="Wang S">S. Wang</name>
</author>
<author><name sortKey="Wang, Z" uniqKey="Wang Z">Z. Wang</name>
</author>
<author><name sortKey="Song, N" uniqKey="Song N">N. Song</name>
</author>
<author><name sortKey="Zeng, F" uniqKey="Zeng F">F. Zeng</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wang, Y" uniqKey="Wang Y">Y. Wang</name>
</author>
<author><name sortKey="Liu, L" uniqKey="Liu L">L. Liu</name>
</author>
<author><name sortKey="Chen, L" uniqKey="Chen L">L. Chen</name>
</author>
<author><name sortKey="Chen, T" uniqKey="Chen T">T. Chen</name>
</author>
<author><name sortKey="Sun, F" uniqKey="Sun F">F. Sun</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wang, Y" uniqKey="Wang Y">Y. Wang</name>
</author>
<author><name sortKey="Wang, K" uniqKey="Wang K">K. Wang</name>
</author>
<author><name sortKey="Lu, Y Y" uniqKey="Lu Y">Y. Y. Lu</name>
</author>
<author><name sortKey="Sun, F" uniqKey="Sun F">F. Sun</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wen, C" uniqKey="Wen C">C. Wen</name>
</author>
<author><name sortKey="Zheng, Z" uniqKey="Zheng Z">Z. Zheng</name>
</author>
<author><name sortKey="Shao, T" uniqKey="Shao T">T. Shao</name>
</author>
<author><name sortKey="Lin, L" uniqKey="Lin L">L. Lin</name>
</author>
<author><name sortKey="Xie, Z" uniqKey="Xie Z">Z. Xie</name>
</author>
<author><name sortKey="Chatelier, E L" uniqKey="Chatelier E">E. L. Chatelier</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="White, J R" uniqKey="White J">J. R. White</name>
</author>
<author><name sortKey="Nagarajan, N" uniqKey="Nagarajan N">N. Nagarajan</name>
</author>
<author><name sortKey="Pop, M" uniqKey="Pop M">M. Pop</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wiest, R" uniqKey="Wiest R">R. Wiest</name>
</author>
<author><name sortKey="Lawson, M" uniqKey="Lawson M">M. Lawson</name>
</author>
<author><name sortKey="Geuking, M" uniqKey="Geuking M">M. Geuking</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wu, Y W" uniqKey="Wu Y">Y.-W. Wu</name>
</author>
<author><name sortKey="Simmons, B A" uniqKey="Simmons B">B. A. Simmons</name>
</author>
<author><name sortKey="Singer, S W" uniqKey="Singer S">S. W. Singer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Xing, X" uniqKey="Xing X">X. Xing</name>
</author>
<author><name sortKey="Liu, J S" uniqKey="Liu J">J. S. Liu</name>
</author>
<author><name sortKey="Zhong, W" uniqKey="Zhong W">W. Zhong</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Yatsunenko, T" uniqKey="Yatsunenko T">T. Yatsunenko</name>
</author>
<author><name sortKey="Rey, F E" uniqKey="Rey F">F. E. Rey</name>
</author>
<author><name sortKey="Manary, M J" uniqKey="Manary M">M. J. Manary</name>
</author>
<author><name sortKey="Trehan, I" uniqKey="Trehan I">I. Trehan</name>
</author>
<author><name sortKey="Dominguez Bello, M G" uniqKey="Dominguez Bello M">M. G. Dominguez-Bello</name>
</author>
<author><name sortKey="Contreras, M" uniqKey="Contreras M">M. Contreras</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zaharia, M" uniqKey="Zaharia M">M. Zaharia</name>
</author>
<author><name sortKey="Chowdhury, M" uniqKey="Chowdhury M">M. Chowdhury</name>
</author>
<author><name sortKey="Franklin, M J" uniqKey="Franklin M">M. J. Franklin</name>
</author>
<author><name sortKey="Shenker, S" uniqKey="Shenker S">S. Shenker</name>
</author>
<author><name sortKey="Stoica, I" uniqKey="Stoica I">I. Stoica</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zhang, X" uniqKey="Zhang X">X. Zhang</name>
</author>
<author><name sortKey="Lu, X" uniqKey="Lu X">X. Lu</name>
</author>
<author><name sortKey="Shi, Q" uniqKey="Shi Q">Q. Shi</name>
</author>
<author><name sortKey="Xu, X Q" uniqKey="Xu X">X. Q. Xu</name>
</author>
<author><name sortKey="Leung, H C" uniqKey="Leung H">H. C. Leung</name>
</author>
<author><name sortKey="Harris, L N" uniqKey="Harris L">L. N. Harris</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">Front Microbiol</journal-id>
<journal-id journal-id-type="iso-abbrev">Front Microbiol</journal-id>
<journal-id journal-id-type="publisher-id">Front. Microbiol.</journal-id>
<journal-title-group><journal-title>Frontiers in Microbiology</journal-title>
</journal-title-group>
<issn pub-type="epub">1664-302X</issn>
<publisher><publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">29774017</article-id>
<article-id pub-id-type="pmc">5943621</article-id>
<article-id pub-id-type="doi">10.3389/fmicb.2018.00872</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Microbiology</subject>
<subj-group><subject>Methods</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group><article-title>Identifying <italic>Group-Specific</italic>
 Sequences for Microbial Communities Using Long <italic>k</italic>
-mer Sequence Signatures</article-title>
</title-group>
<contrib-group><contrib contrib-type="author"><name><surname>Wang</surname>
<given-names>Ying</given-names>
</name>
<xref ref-type="aff" rid="aff1"><sup>1</sup>
</xref>
<xref ref-type="corresp" rid="c001"><sup>*</sup>
</xref>
<uri xlink:type="simple" xlink:href="http://loop.frontiersin.org/people/452253/overview"></uri>
</contrib>
<contrib contrib-type="author"><name><surname>Fu</surname>
<given-names>Lei</given-names>
</name>
<xref ref-type="aff" rid="aff1"><sup>1</sup>
</xref>
<uri xlink:type="simple" xlink:href="http://loop.frontiersin.org/people/554628/overview"></uri>
</contrib>
<contrib contrib-type="author"><name><surname>Ren</surname>
<given-names>Jie</given-names>
</name>
<xref ref-type="aff" rid="aff2"><sup>2</sup>
</xref>
<uri xlink:type="simple" xlink:href="http://loop.frontiersin.org/people/547468/overview"></uri>
</contrib>
<contrib contrib-type="author"><name><surname>Yu</surname>
<given-names>Zhaoxia</given-names>
</name>
<xref ref-type="aff" rid="aff3"><sup>3</sup>
</xref>
<uri xlink:type="simple" xlink:href="http://loop.frontiersin.org/people/34164/overview"></uri>
</contrib>
<contrib contrib-type="author"><name><surname>Chen</surname>
<given-names>Ting</given-names>
</name>
<xref ref-type="aff" rid="aff2"><sup>2</sup>
</xref>
<xref ref-type="aff" rid="aff4"><sup>4</sup>
</xref>
<xref ref-type="aff" rid="aff5"><sup>5</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Sun</surname>
<given-names>Fengzhu</given-names>
</name>
<xref ref-type="aff" rid="aff2"><sup>2</sup>
</xref>
<xref ref-type="aff" rid="aff6"><sup>6</sup>
</xref>
<xref ref-type="corresp" rid="c001"><sup>*</sup>
</xref>
<uri xlink:type="simple" xlink:href="http://loop.frontiersin.org/people/47840/overview"></uri>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup>
<institution>Department of Automation, Xiamen University</institution>
,<addr-line>Xiamen</addr-line>
,<country>China</country>
</aff>
<aff id="aff2"><sup>2</sup>
<institution>Molecular and Computational Biology Program, University of Southern California, Los Angeles</institution>
,<addr-line>CA</addr-line>
,<country>United States</country>
</aff>
<aff id="aff3"><sup>3</sup>
<institution>Department of Statistics, University of California, Irvine</institution>
,<addr-line>Irvine, CA</addr-line>
,<country>United States</country>
</aff>
<aff id="aff4"><sup>4</sup>
<institution>Bioinformatics Division, Tsinghua National Laboratory of Information Science and Technology, Tsinghua University</institution>
,<addr-line>Beijing</addr-line>
,<country>China</country>
</aff>
<aff id="aff5"><sup>5</sup>
<institution>Department of Computer Science and Technology, Tsinghua University</institution>
,<addr-line>Beijing</addr-line>
,<country>China</country>
</aff>
<aff id="aff6"><sup>6</sup>
<institution>Center for Computational Systems Biology, Fudan University</institution>
,<addr-line>Shanghai</addr-line>
,<country>China</country>
</aff>
<author-notes><fn fn-type="edited-by"><p>Edited by: Jessica Galloway-Pena, The University of Texas MD Anderson Cancer Center, United States</p>
</fn>
<fn fn-type="edited-by"><p>Reviewed by: Wenxuan Zhong, University of Georgia, United States; Jonathan Badger, National Cancer Institute (NCI), United States</p>
</fn>
<corresp id="c001">*Correspondence: Ying Wang, <email>wangying@xmu.edu.cn</email>
 Fengzhu Sun, <email>fsun@usc.edu</email>
; <email>fsun@dornsife.usc.edu</email>
</corresp>
<fn fn-type="other" id="fn002"><p>This article was submitted to Systems Microbiology, a section of the journal Frontiers in Microbiology</p>
</fn>
</author-notes>
<pub-date pub-type="epub"><day>03</day>
<month>5</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="collection"><year>2018</year>
</pub-date>
<volume>9</volume>
<elocation-id>872</elocation-id>
<history><date date-type="received"><day>15</day>
<month>11</month>
<year>2017</year>
</date>
<date date-type="accepted"><day>16</day>
<month>4</month>
<year>2018</year>
</date>
</history>
<permissions><copyright-statement>Copyright © 2018 Wang, Fu, Ren, Yu, Chen and Sun.</copyright-statement>
<copyright-year>2018</copyright-year>
<copyright-holder>Wang, Fu, Ren, Yu, Chen and Sun</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><license-p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</license-p>
</license>
</permissions>
<abstract><p>Comparing metagenomic samples is crucial for understanding microbial communities. For different groups of microbial communities, such as human gut metagenomic samples from patients with a certain disease and healthy controls, identifying <italic>group-specific</italic>
 sequences offers essential information for potential biomarker discovery. A sequence that is present, or rich, in one group, but absent, or scarce, in another group is considered “<italic>group-specific</italic>
” in our study. Our main purpose is to discover <italic>group-specific</italic>
 sequence regions between control and case groups as disease-associated markers. We developed a long <italic>k</italic>
-mer (<italic>k</italic>
 ≥ 30 bps)-based computational pipeline to detect <italic>group-specific</italic>
 sequences at strain resolution free from reference sequences, sequence alignments, and metagenome-wide <italic>de novo</italic>
 assembly. We called our method MetaGO: <italic>Group-specific</italic>
 oligonucleotide analysis for metagenomic samples. An open-source pipeline on <italic>Apache Spark</italic>
 was developed with parallel computing. We applied MetaGO to one simulated and three real metagenomic datasets to evaluate the discriminative capability of identified <italic>group-specific</italic>
 markers. In the simulated dataset, 99.11% of <italic>group-specific</italic>
 logical <italic>40</italic>
-mers covered 98.89% <italic>disease-specific</italic>
 regions from the disease-associated strain. In addition, 97.90% of <italic>group-specific</italic>
 numerical <italic>40</italic>
-mers covered 99.61 and 96.39% of differentially abundant genome and regions between two groups, respectively. For a large-scale metagenomic liver cirrhosis (LC)-associated dataset, we identified 37,647 <italic>group-specific 40-</italic>
mer features. Any one of the features can predict disease status of the training samples with the average of sensitivity and specificity higher than 0.8. The random forests classification using the top 10 <italic>group-specific</italic>
 features yielded a higher AUC (from ∼0.8 to ∼0.9) than that of previous studies. All <italic>group-specific 40-</italic>
mers were present in LC patients, but not healthy controls. All the assembled 11 <italic>LC-specific</italic>
 sequences can be mapped to two strains of <italic>Veillonella parvula</italic>
: UTDB1-3 and DSM2008. The experiments on the other two real datasets related to Inflammatory Bowel Disease and Type 2 Diabetes in Women consistently demonstrated that MetaGO achieved better prediction accuracy with fewer features compared to previous studies. The experiments showed that MetaGO is a powerful tool for identifying <italic>group-specific k</italic>
-mers, which would be clinically applicable for disease prediction. MetaGO is available at <ext-link ext-link-type="uri" xlink:href="https://github.com/VVsmileyx/MetaGO">https://github.com/VVsmileyx/MetaGO</ext-link>
.</p>
</abstract>
<kwd-group><kwd>long <italic>k</italic>
-mer</kwd>
<kwd>classification</kwd>
<kwd><italic>group-specific</italic>
 sequence</kwd>
<kwd>metagenomics</kwd>
<kwd>microbial community</kwd>
<kwd>disease prediction</kwd>
</kwd-group>
<funding-group><award-group><funding-source id="cn001">National Natural Science Foundation of China<named-content content-type="fundref-id">10.13039/501100001809</named-content>
</funding-source>
<award-id rid="cn001">61673324</award-id>
<award-id rid="cn001">61673324</award-id>
<award-id rid="cn001">61561146396</award-id>
</award-group>
<award-group><funding-source id="cn002">National Science Foundation<named-content content-type="fundref-id">10.13039/100000001</named-content>
</funding-source>
<award-id rid="cn002">DMS-1518001</award-id>
</award-group>
<award-group><funding-source id="cn003">Foundation for the National Institutes of Health<named-content content-type="fundref-id">10.13039/100000009</named-content>
</funding-source>
<award-id rid="cn003">R01GM120624</award-id>
</award-group>
<award-group><funding-source id="cn004">Natural Science Foundation of Fujian Province<named-content content-type="fundref-id">10.13039/501100003392</named-content>
</funding-source>
<award-id rid="cn004">2016J01316</award-id>
</award-group>
<award-group><funding-source id="cn005">China Scholarship Council<named-content content-type="fundref-id">10.13039/501100004543</named-content>
</funding-source>
<award-id rid="cn005">201606315011</award-id>
</award-group>
</funding-group>
<counts><fig-count count="6"></fig-count>
<table-count count="4"></table-count>
<equation-count count="0"></equation-count>
<ref-count count="49"></ref-count>
<page-count count="18"></page-count>
<word-count count="0"></word-count>
</counts>
</article-meta>
</front>
<body><sec><title>Introduction</title>
<p>High-throughput sequencing technologies have ushered in new views of ubiquity and diversity of microbial communities (<xref rid="B47" ref-type="bibr">Yatsunenko et al., 2012</xref>
). Metagenomic sequencing data permit comprehensive profiling of microbial communities at single-nucleotide resolution. The ability to compare two groups of metagenomic samples is crucial for understanding microbial communities and their effects on hosts. Typically, for two groups of individuals, patients with a certain disease and healthy individuals, <italic>group-specific</italic>
 markers offer significant support in understanding and predicting disease. Here, “<italic>group-specific</italic>
 markers” can be genes, species, or sequences present, or rich, in one group, but absent, or scarce, in another group. “<italic>Group-specific</italic>
” focuses on the highest discriminative power, rather than the statistically significant difference (<xref rid="B43" ref-type="bibr">White et al., 2009</xref>
; <xref rid="B38" ref-type="bibr">Segata et al., 2011</xref>
), to classify, or predict, case and control groups. Accordingly, prediction performance evaluates the discriminative capability of identified <italic>group-specific</italic>
 features.</p>
<p>Some studies characterized microbiomes by aligning reads to reference genomes or 16S rRNA marker genes (<xref rid="B5" ref-type="bibr">Costello et al., 2009</xref>
; <xref rid="B32" ref-type="bibr">Quast et al., 2012</xref>
; <xref rid="B21" ref-type="bibr">Lozupone et al., 2013</xref>
; <xref rid="B14" ref-type="bibr">Jiang, 2015</xref>
). It was realized that the alignment-based methods were limited by incomplete or inaccurate reference sequences (<xref rid="B16" ref-type="bibr">Kunin et al., 2008</xref>
). For example, only about 31.0–48.8% of the shotgun reads from human gut could be aligned to 194 public human gut bacterial genomes, and 7.6–21.2% to the bacterial genomes deposited in GenBank (<xref rid="B29" ref-type="bibr">Qin et al., 2010</xref>
). Recently, more studies adopted reference-free strategies to analyze the compositional differences of metagenomes between control and case groups at the microbial gene, gene set, or species levels. Generally, contigs were produced through the metagenome-wide <italic>de novo</italic>
 assembly, and a gene catalog was established through open-reading frame (ORF) prediction. The above processing was first applied to human microbiome of inflammatory bowel disease (IBD) (<xref rid="B29" ref-type="bibr">Qin et al., 2010</xref>
). Follow-up investigations were conducted based on the constructed gene sets: approximately 60,000 associated gene markers were identified to predict Type 2 Diabetes (T2D), and the concept of a metagenomic linkage group was proposed, which is a group of genes that co-exist among samples and has a consistent abundance level and taxonomic assignments (<xref rid="B30" ref-type="bibr">Qin et al., 2012</xref>
). The metagenomic gene clusters based on high abundance correlations were further applied to predict T2D in European women using gut metagenomic samples (<xref rid="B15" ref-type="bibr">Karlsson et al., 2013</xref>
). The gene clusters containing a large number of genes (such as >700) assist <italic>de novo</italic>
 genome assembly to discover microbial species associated with liver cirrhosis (LC) (<xref rid="B31" ref-type="bibr">Qin et al., 2014</xref>
) and IBD (<xref rid="B24" ref-type="bibr">Nielsen et al., 2014</xref>
). <xref rid="B27" ref-type="bibr">Pasolli et al. (2016</xref>
, <xref rid="B26" ref-type="bibr">2017</xref>
) conducted prediction tasks on 2424 metagenomic samples from eight large-scale projects using species-level relative abundances and the presence of strain-specific markers as features. <xref rid="B42" ref-type="bibr">Wen et al. (2017)</xref>
 compared the predicting performances of three types of biomarkers: sequenced reference genomes, genes and gene clusters, for ankylosing spondylitis based on gut metagenomic samples. They found that gene markers performed better than reference genome markers and clustered gene markers, and the clustered gene markers might be limited by the unknown taxonomic organisms in clusters. Almost all the above studies followed the analysis pipeline of <italic>de novo</italic>
 contig assembly, gene prediction, and gene clustering. Previous studies concluded that metagenome-assembly performs well for microbial communities that have high coverage of phylogenetically distinct, and low taxonomic diversity (<xref rid="B25" ref-type="bibr">Papudeshi et al., 2017</xref>
), but the presence of closely related strains in one community would substantially have negative effect on the assembly performance (<xref rid="B36" ref-type="bibr">Sangwan et al., 2016</xref>
; <xref rid="B37" ref-type="bibr">Sczyrba et al., 2017</xref>
). Moreover, high co-abundance among species would result in multiple species in one cluster (<xref rid="B24" ref-type="bibr">Nielsen et al., 2014</xref>
). Therefore, components with closely related genome sequences or abundance would diminish the performance of assembly and clustering in microbial community studies.</p>
<p>Besides genes or species, assembled contigs have also been used as features to predict disease. Several contig binning tools, such as CONCOCT (<xref rid="B1" ref-type="bibr">Alneberg et al., 2014</xref>
), MaxBin2.0 (<xref rid="B45" ref-type="bibr">Wu et al., 2016</xref>
), COCACOLA (<xref rid="B22" ref-type="bibr">Lu et al., 2017</xref>
), and MetaGen (<xref rid="B46" ref-type="bibr">Xing et al., 2017</xref>
), were developed for binning contigs assuming that contigs with similar coverage/relative abundances over different samples come from the same genomes. In particular, although the main purpose of MetaGen (<xref rid="B46" ref-type="bibr">Xing et al., 2017</xref>
) is to identify microbial species in the community through binning, the study not only designed comprehensive experiments to analyze the effect of sequencing depth, sample size, number of species and sequence similarity, but also used the relative abundance of each bin to predict IBD/T2D/obesity disease on metagenomic datasets to evaluate the binned microbial composition. Similarly, <xref rid="B33" ref-type="bibr">Ren et al. (2017)</xref>
 developed a novel pipeline to predict the disease status of LC using the abundance of viral contig bins. Both studies made novel attempts to identify markers through assembling <italic>de novo</italic>
 reads into contigs and then binning contigs, which achieved excellent predicting results. The basic idea is to discover species markers that are differentially abundant between case and control groups. However, current assembly tools are hard to handle large-scale datasets: reads assembly involves the construction of <italic>De Bruijn</italic>
 graph, error correction, and path resolution; contig binning requires mapping reads to the assembled contigs; both would require extremely large memory and are very time-consuming. Also, if the main purpose is to discover <italic>group-specific</italic>
 markers, it is not necessary to assemble contigs for the genomes that are not associated with the disease.</p>
<p>The <italic>k</italic>
-mer frequencies (i.e., the number of occurrences of <italic>k</italic>
-mers within the whole sequencing data) are another representative alignment-free feature to characterize a microbial community. The frequency distributions of <italic>2–10</italic>
-mers were used to compare metagenomic and meta-transcriptomic communities (<xref rid="B13" ref-type="bibr">Jiang et al., 2012</xref>
; <xref rid="B40" ref-type="bibr">Wang et al., 2014</xref>
; <xref rid="B20" ref-type="bibr">Liao et al., 2016</xref>
) or to improve contig binning within a community (<xref rid="B41" ref-type="bibr">Wang et al., 2017</xref>
). Also, <xref rid="B6" ref-type="bibr">Cui and Zhang (2013)</xref>
 classified clinical metagenomic samples using the frequencies of <italic>2–10</italic>
-mers.</p>
<p>However, <italic>2–10</italic>
-mers are too short to capture specific details inside the microbial community, such as sequences present, or rich, in one group, but absent, or scarce in another group. Intuitively, longer <italic>k</italic>
-mers contain richer biological information in the nucleotide sequences. The long <italic>k</italic>
-mers had been mainly utilized as seed index in sequence assembly and alignment (<xref rid="B19" ref-type="bibr">Li et al., 2010</xref>
; <xref rid="B9" ref-type="bibr">Grabherr et al., 2011</xref>
). Recently, long <italic>k</italic>
-mers (≥20 bp) began to be utilized to more applications: our previous study explored the effect of <italic>k</italic>
-mer length on an unsupervised comparison between metagenomic samples and verified the promising performance of long <italic>k</italic>
-mers to depict the specific characteristics of microbial communities (<xref rid="B39" ref-type="bibr">Wang et al., 2015</xref>
). <xref rid="B10" ref-type="bibr">Han et al. (2017)</xref>
 detected differentially abundant <italic>21</italic>
-mers in metagenomic samples from T2D and healthy individuals, assembled the reads containing those <italic>21</italic>
-mers into contigs, and then predicted genes based on the contigs. Finally, they used the gene abundances to predict T2D status. Our study differs from <xref rid="B10" ref-type="bibr">Han et al. (2017)</xref>
 in the sense that we do not predict genes based on the contigs assembled from reads containing statistically differentially abundant <italic>k</italic>
-mers. Instead, we identified <italic>group-specific k</italic>
-mers using discriminative power to separate two groups and predicted disease status with <italic>k</italic>
-mers as features. Moreover, <italic>group-specific k</italic>
-mers were assembled to contigs directly. Rahman et al. (unpublished) found significant differentially abundant <italic>31</italic>
-mers between two groups of 1000 genomes data and discovered SNPs between different populations, which is highly different from the objectives of this study. The frequency vector of long <italic>k</italic>
-mers (∼30 bp) was also applied to calculate the dissimilarity between metagenomic samples using 16 standard ecological distances (<xref rid="B3" ref-type="bibr">Benoit et al., 2016</xref>
). The long <italic>k</italic>
-mers began to present attractive potentials to characterize high-throughput sequencing data.</p>
<p>Since sufficiently long <italic>k</italic>
-mers are usually specific to a genome (<xref rid="B8" ref-type="bibr">Fofanov et al., 2004</xref>
), therefore, we proposed a computational framework to identify <italic>group-specific</italic>
 sequences between two groups of metagenomic samples with long (≥30 bp) <italic>k-</italic>
mers in this study. We call our method MetaGO: <italic>Group-specific</italic>
 oligonucleotide analysis for metagenomic samples. The main purpose of MetaGO is to discover <italic>group-specific</italic>
 sequence regions between control and case groups as disease-associated markers. Instead of using statistically significant difference as index, we considered the discriminant power to separate two groups of single <italic>k</italic>
-mer. A <italic>k</italic>
-mer is considered <italic>group-specific</italic>
 if (1) the average of sensitivity and specificity (ASS) is higher than a preset threshold when using the presence/absence of the <italic>k</italic>
-mer on the sequencing data to predict disease status, or (2) the <italic>k</italic>
-mer’s frequencies are significantly different between two groups of samples (Wilcoxon rank-sum test, <italic>p</italic>
-value ≤ 0.01) and the ASS is higher than a preset threshold using logistic regression. The <italic>group-specific k</italic>
-mers are identified based on the training set. In our study, <italic>k</italic>
-mer length is set between 30 and 40 given the tradeoff among sensitivity, specificity, and computational cost. To reduce the computational burden from long <italic>k</italic>
-mers, we developed an open-source, parallel-computing pipeline on <italic>Apache Spark</italic>
. Once the <italic>group-specific k-</italic>
mers are identified, we assembled them into <italic>group-specific</italic>
 sequences. The assembly on the markedly reduced number of long <italic>k</italic>
-mers will be more computationally efficient and accurate.</p>
<p>MetaGO was tested on one simulated and three real metagenomic datasets. In the simulated dataset, for the two strains sharing 87% common sequences where one is disease specific and the other one is present in both groups, we identified <italic>group-specific</italic>
 logical <italic>40</italic>
-mers that covered 98.89% (recall) of the <italic>disease-specific</italic>
 sequence regions from the disease-associated strain with 98.91% precision. In addition, 98.83% of the <italic>group-specific</italic>
 numerical <italic>40</italic>
-mers covered 99.01 and 97.30% of the differential-abundant genome and regions, respectively. For the metagenomic LC-associated dataset (<xref rid="B31" ref-type="bibr">Qin et al., 2014</xref>
), it is composed of human fecal samples from 98 LC patients and 83 healthy controls, as well as an additional independent dataset containing 25 patients and 31 controls. The <italic>k</italic>
-mer length was set as 40 because of the large sample size (number of samples). In our experiment, two-thirds of the 98 patients and 83 control samples were randomly selected as the training set, leaving one-third as the validation set and the extra 25 patients and 31 controls as the independent testing set. In total, 37,647 <italic>group-specific 40</italic>
-mers were identified on the training set, and 35,652 and 12,944 of the <italic>group-specific 40</italic>
-mers yielded ASS ≥ 0.8 on the validation and testing sets, respectively. The <italic>single-logical-feature</italic>
 predictor with the highest ASS score 0.87 on the training set predicted the disease status in the validation and testing sets with ASS score as 0.88 and 0.83, respectively. Using the top 10 <italic>group-specific 40</italic>
-mers, the random forests classifier achieved the area under the receiver operating characteristic (AUC) as 0.963, 0.969, and 0.942 on training, validation, and testing sets, respectively. It is interesting to note that all 37,647 <italic>40-</italic>
mers were present in LC patients but absent from healthy controls. The <italic>LC-specific 40</italic>
-mers were assembled into 11 sequences with a length between 210 and 350 bp, and they demonstrated the distinguishing coverages between two groups. All the identified <italic>LC-specific</italic>
 sequences could be matched to two strains of <italic>Veillonella parvula</italic>
, UTDB1-3 and DSM2008 with 97–100% identity. And 83.2 and 79.6% of the 37,647 <italic>group-specific 40</italic>
-mers could be matched to strain UTDB1-3 and DSM2008, respectively.</p>
<p>We also identified <italic>group-specific k</italic>
-mers based on two more metagenomic disease-associated datasets: IBD associated (<xref rid="B29" ref-type="bibr">Qin et al., 2010</xref>
) and WT2D (T2D in women) associated (<xref rid="B15" ref-type="bibr">Karlsson et al., 2013</xref>
). Based on the identified <italic>group-specific k</italic>
-mers, our pipeline achieved substantially better prediction performance using relatively fewer features compared with previous studies having identical or relaxed experimental settings. All experiments demonstrated long <italic>k</italic>
-mers to be more efficient in capturing the specific information of sequencing data and discriminating gut microbiome communities between control and case groups. It should be noted that <italic>group-specific</italic>
 sequences are identified free from reference sequences, metagenome-wide assembly, and sequence alignments. MetaGO greatly facilitates the identification of clinically meaningful biomarkers.</p>
</sec>
<sec sec-type="materials|methods" id="s1"><title>Materials and Methods</title>
<sec><title>Description of Terms</title>
<p><italic>A group-specific feature</italic>
 is a <italic>k</italic>
-mer present, or rich, in the metagenomic sequencing data of one group, but absent, or sparse, in the sequencing data of another group. A <italic>k-</italic>
mer is a word composed of <italic>k</italic>
 oligonucleotides, and the total number of all possible <italic>k</italic>
-mers is 4<sup>k</sup>
.</p>
<p>We defined <italic>k</italic>
-mer features in the following two ways:</p>
<p><italic>Numerical features</italic>
 are the normalized frequencies of <italic>k</italic>
-mers. The numerical feature of a <italic>k</italic>
-mer <italic>i</italic>
 in sample <italic>j</italic>
 is denoted as <italic>f<sub>i</sub>
</italic>
(<italic>j</italic>
) and is defined in Equation (1), where f<sub>i</sub>
°(j) is the number of occurrences of <italic>k</italic>
-mer <italic>i</italic>
 in sample <italic>j</italic>
, and <italic>n</italic>
 is the total number of <italic>k</italic>
-mers, that is 4<italic><sup>k</sup>
</italic>
. So the normalization is the number of occurrences of the <italic>k</italic>
-mer over the total number of occurrences for all <italic>k</italic>
-mers in one sample. Each <italic>k</italic>
-mer has the same length <italic>k</italic>
, so length is not considered during the normalization.</p>
<disp-formula id="E1"><label>(1)</label>
<mml:math id="M1"><mml:mrow><mml:msub><mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac><mml:mrow><mml:msubsup><mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
<mml:mo>°</mml:mo>
</mml:msubsup>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow><mml:msubsup><mml:mo>∑</mml:mo>
<mml:mrow><mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
<mml:msubsup><mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
<mml:mo>°</mml:mo>
</mml:msubsup>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>...</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p><italic>Logical features</italic>
 are the logicalization of numerical features. They use 1 and 0 to represent <italic>k</italic>
-mers as present or absent in one sample, as shown in Equation (2),</p>
<disp-formula id="E2"><label>(2)</label>
<mml:math id="M2"><mml:mrow><mml:msubsup><mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
<mml:mrow><mml:mo stretchy="false">(</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mrow><mml:mo>{</mml:mo>
<mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mn>1</mml:mn>
<mml:mi mathvariant="normal"> </mml:mi>
<mml:mtext>if</mml:mtext>
<mml:mi mathvariant="normal"> </mml:mi>
<mml:msub><mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>></mml:mo>
<mml:mn>0</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr><mml:mtd><mml:mn>0</mml:mn>
<mml:mi mathvariant="normal"> </mml:mi>
<mml:mtext>if</mml:mtext>
<mml:mi mathvariant="normal"> </mml:mi>
<mml:msub><mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where f<sub>i</sub>
<sup>(l)</sup>
(j) is the logical value of <italic>k</italic>
-mer <italic>i</italic>
 in sample <italic>j</italic>
, and the superscript “<italic>l</italic>
” indicates logical feature.</p>
<p><italic>A single-logical-feature predictor</italic>
, as represented in Equations (3) and (4), is used to predict disease status based on whether a <italic>k</italic>
-mer <italic>i</italic>
 is present in the sequencing data of sample <italic>j</italic>
 or not.</p>
<disp-formula id="E3"><label>(3)</label>
<mml:math id="M3"><mml:mrow><mml:msubsup><mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
<mml:mrow><mml:mo stretchy="false">(</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mrow><mml:mo>{</mml:mo>
<mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mn>1</mml:mn>
<mml:mi mathvariant="normal"> </mml:mi>
<mml:mtext>then sample </mml:mtext>
<mml:mi>j</mml:mi>
<mml:mo>∈</mml:mo>
<mml:mtext>Group</mml:mtext>
<mml:mo>+</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr><mml:mtd><mml:mn>0</mml:mn>
<mml:mi mathvariant="normal"> </mml:mi>
<mml:mtext>then sample </mml:mtext>
<mml:mi>j</mml:mi>
<mml:mo>∈</mml:mo>
<mml:mtext>Group</mml:mtext>
<mml:mo>−</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>or</p>
<disp-formula id="E4"><label>(4)</label>
<mml:math id="M4"><mml:mrow><mml:msubsup><mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
<mml:mrow><mml:mo stretchy="false">(</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mrow><mml:mo>{</mml:mo>
<mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mn>1</mml:mn>
<mml:mi mathvariant="normal"> </mml:mi>
<mml:mtext>then sample </mml:mtext>
<mml:mi>j</mml:mi>
<mml:mo>∈</mml:mo>
<mml:mtext>Group</mml:mtext>
<mml:mo>−</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr><mml:mtd><mml:mn>0</mml:mn>
<mml:mi mathvariant="normal"> </mml:mi>
<mml:mtext>then sample </mml:mtext>
<mml:mi>j</mml:mi>
<mml:mo>∈</mml:mo>
<mml:mtext>Group</mml:mtext>
<mml:mo>+</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p><italic>A single-numerical-feature logistic regression</italic>
 predicts the case and control status based on one single numerical feature, and it is used as the independent variable in a logistic regression. An example of each term above is given in <bold>Supplementary File <xref ref-type="supplementary-material" rid="SM1">S1</xref>
</bold>
.</p>
</sec>
<sec><title>The Computational Framework to Identify <italic>Group-Specific</italic>
 Sequences</title>
<p>As shown in <bold>Figure <xref ref-type="fig" rid="F1">1</xref>
</bold>
, the computational framework of MetaGO consists of three modules. (1) <italic>Creating a feature vector for each sample</italic>
. The feature vector is composed of the number of occurrences for each <italic>k</italic>
-mer through all reads in one sample. (2) <italic>Feature preprocessing</italic>
. After removing <italic>k</italic>
-mers occurring only once and normalizing <italic>k-</italic>
mer frequencies, the feature matrix is integrated on the feature vectors across all training samples. The <italic>k</italic>
-mers that are absent in most training samples are filtered out. (3) <italic>Identifying group-specific features</italic>
. The logical and numerical features with high discriminant power are selected.</p>
<fig id="F1" position="float"><label>FIGURE 1</label>
<caption><p>The MetaGO framework to identify <italic>group-specific</italic>
 sequences with long <italic>k</italic>
-mer features. The framework is composed of three modules. (1) The feature vector of each metagenomic sample is composed of the frequencies of all <italic>k</italic>
-mers. (2) The <italic>k</italic>
-mers are preprocessed by discarding features occurring only once, normalization, integrating the matrix and removing the <italic>k</italic>
-mers absent from most training samples. (3) The features are represented as logical and numerical forms, and the features with high discriminant power are identified to be <italic>group-specific</italic>
.</p>
</caption>
<graphic xlink:href="fmicb-09-00872-g001"></graphic>
</fig>
<p>MetaGO was developed on <italic>Apache Spark</italic>
 to reduce computational costs through parallel running on HDFS of Hadoop or a stand-alone multi-core server. The open-source pipeline is available at <ext-link ext-link-type="uri" xlink:href="https://github.com/VVsmileyx/MetaGO">https://github.com/VVsmileyx/MetaGO</ext-link>
.</p>
</sec>
<sec><title>Module 1: Creating Feature Vectors</title>
<p>A feature vector consists of elements that account for the number of occurrences (i.e., frequency) for each <italic>k</italic>
-mer through all the reads in one metagenomic sample. Existing tools, such as DSK (<xref rid="B35" ref-type="bibr">Rizk et al., 2013</xref>
) or JELLYFISH (<xref rid="B23" ref-type="bibr">Marçais and Kingsford, 2011</xref>
), are available for counting <italic>k</italic>
-mer frequency. In our study, we used DSK to count <italic>k</italic>
-mers. The reverse complements of reads were taken into consideration. A <italic>k</italic>
-mer and its reverse complement were considered as the same object, so the theoretical dimension of a feature vector for one sample is shrunk to <inline-formula><mml:math id="M5"><mml:mrow><mml:mfrac><mml:mrow><mml:msup><mml:mn>4</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msup><mml:mn>2</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:mfrac>
</mml:mrow>
</mml:math>
</inline-formula>
 for even <italic>k</italic>
 and <inline-formula><mml:math id="M6"><mml:mrow><mml:mfrac><mml:mrow><mml:msup><mml:mn>4</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:mfrac>
</mml:mrow>
</mml:math>
</inline-formula>
 for odd <italic>k</italic>
. Furthermore, only the <italic>k</italic>
-mers that occur in a sample are stored in the feature vector to reduce storage space.</p>
</sec>
<sec><title>Module 2: Feature Preprocessing</title>
<sec><title>Discard <italic>k-</italic>
mer Features Occurring Only Once</title>
<p>With the increase of <italic>k</italic>
-mer length, <italic>k</italic>
-mer frequency decreases exponentially, and the <italic>k</italic>
-mer vector is highly sparse. A <italic>k</italic>
-mer occurring only once might be caused by low abundance or sequencing errors. To achieve reproducible and stable prediction models, <italic>k</italic>
-mers occurring once were removed from the frequency vector, and this step was implemented by DSK during <italic>k-</italic>
mer counting in our study.</p>
</sec>
<sec><title>Normalize <italic>k</italic>
-mer Frequencies</title>
<p>Owing to different sequencing depths in samples, the frequency of a <italic>k</italic>
-mer is normalized using Equation (1) by the total number of occurrences of all <italic>k</italic>
-mers.</p>
</sec>
<sec><title>Build Feature Matrix Across Training Samples</title>
<p>Feature vectors across all training samples are integrated as a matrix. This step is extremely time- and memory-consuming as a result of the large sample size and the long <italic>k</italic>
-mer length. Just storing non-zero <italic>k</italic>
-mers in each feature vector, the integration process requires huge amounts of sorting and matching of <italic>k</italic>
-mers. When <italic>k</italic>
 = 40, approximately 10<sup>9</sup>
<italic>40</italic>
-mer features occur more than once. The feature matrix <italic>F</italic>
 is denoted as Equation (5), where <italic>k</italic>
-mer<sub>1</sub>
, <italic>k</italic>
-mer<sub>2</sub>
, …, <italic>k</italic>
-mer<italic><sub>m</sub>
</italic>
 are the <italic>m k</italic>
-mer features, and <italic>S</italic>
<sub>1</sub>
, <italic>S</italic>
<sub>2</sub>
, …, <italic>S<sub>N</sub>
</italic>
 are the <italic>N</italic>
 training samples from case and control groups.</p>
<disp-formula id="E5"><label>(5)</label>
<mml:math id="M7"><mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mrow></mml:mrow>
</mml:mtd>
<mml:mtd><mml:mrow><mml:mtable><mml:mtr columnalign="left"><mml:mtd columnalign="left"><mml:mrow><mml:mi mathvariant="normal"> </mml:mi>
<mml:mi mathvariant="normal"> </mml:mi>
<mml:msub><mml:mi>S</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mi mathvariant="normal"> </mml:mi>
<mml:mi mathvariant="normal"> </mml:mi>
</mml:mrow>
</mml:mtd>
<mml:mtd><mml:mi mathvariant="normal"> </mml:mi>
<mml:mrow><mml:mi mathvariant="normal"> </mml:mi>
<mml:msub><mml:mi>S</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mi mathvariant="normal"> </mml:mi>
</mml:mrow>
</mml:mtd>
<mml:mtd><mml:mo>⋯</mml:mo>
<mml:mi mathvariant="normal"> </mml:mi>
<mml:mi mathvariant="normal"> </mml:mi>
</mml:mtd>
<mml:mtd columnalign="right"><mml:mrow><mml:mi mathvariant="normal"> </mml:mi>
<mml:msub><mml:mi>S</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mi mathvariant="normal"> </mml:mi>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr><mml:mtd><mml:mrow><mml:mtable columnalign="right"><mml:mtr columnalign="right"><mml:mtd columnalign="right"><mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:mtable columnalign="left"><mml:mtr columnalign="left"><mml:mtd columnalign="left"><mml:mtable columnalign="right"><mml:mtr><mml:mtd><mml:mi>k</mml:mi>
<mml:mo>−</mml:mo>
<mml:msub><mml:mtext>mer</mml:mtext>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr><mml:mtd><mml:mi>F</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>−</mml:mo>
<mml:msub><mml:mtext>mer</mml:mtext>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr><mml:mtd columnalign="center"><mml:mtext></mml:mtext>
<mml:mo>⋮</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr><mml:mtd><mml:mi>k</mml:mi>
<mml:mo>−</mml:mo>
<mml:msub><mml:mtext>mer</mml:mtext>
<mml:mi>m</mml:mi>
</mml:msub>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mtd>
<mml:mtd><mml:mrow><mml:mrow><mml:mo>(</mml:mo>
<mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mi>f</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mtd>
<mml:mtd><mml:mrow><mml:msub><mml:mi>f</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mtd>
<mml:mtd><mml:mo>⋯</mml:mo>
</mml:mtd>
<mml:mtd><mml:mrow><mml:msub><mml:mi>f</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>N</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mi>f</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mtd>
<mml:mtd><mml:mrow><mml:msub><mml:mi>f</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mtd>
<mml:mtd><mml:mo>⋯</mml:mo>
</mml:mtd>
<mml:mtd><mml:mrow><mml:msub><mml:mi>f</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>N</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr><mml:mtd><mml:mo>⋮</mml:mo>
</mml:mtd>
<mml:mtd><mml:mo>⋮</mml:mo>
</mml:mtd>
<mml:mtd><mml:mo>⋮</mml:mo>
</mml:mtd>
<mml:mtd><mml:mo>⋮</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mi>f</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mtd>
<mml:mtd><mml:mrow><mml:mi>f</mml:mi>
<mml:mi>m</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mtd>
<mml:mtd><mml:mo>⋯</mml:mo>
</mml:mtd>
<mml:mtd><mml:mrow><mml:msub><mml:mi>f</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>N</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mi mathvariant="normal"> </mml:mi>
<mml:mi mathvariant="normal"> </mml:mi>
<mml:mi mathvariant="normal"> </mml:mi>
<mml:mi mathvariant="normal"> </mml:mi>
<mml:mi mathvariant="normal"> </mml:mi>
<mml:mrow><mml:mo mathvariant="normal" mathcolor="black" mathsize="12pt">(</mml:mo>
<mml:mn mathvariant="normal" mathcolor="black" mathsize="12pt">5</mml:mn>
<mml:mo mathvariant="normal" mathcolor="black" mathsize="12pt">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
</sec>
<sec><title>Remove Highly-Sparse Features</title>
<p>The “highly-sparse” feature means that a <italic>k</italic>
-mer is absent in most training samples, i.e., the frequencies of <italic>k</italic>
-mers are 0 in most training cases and controls. Such features have limited contributions to classification. In our study, if a <italic>k-</italic>
mer is absent in more than 80% of control samples and 80% of case samples, the feature is removed. The stringent threshold of 80% offers high confidence in filtering out less useful features.</p>
</sec>
</sec>
<sec><title>Module 3: Identifying <italic>Group-Specific</italic>
 Features</title>
<p>After preprocessing, about 10<sup>6</sup>
 features still remain for <italic>40</italic>
-mers. Simple feature-ranking filtering is more suitable than Wrapper feature selection. Wrapper methods consider the selection of a set of features as a search problem in which different combinations are prepared, evaluated, and compared to other combinations. The dimension of combination space is extremely high for a large number of features in our study. The filtering of <italic>k</italic>
-mers is only based on the training data without touching the validation and testing data.</p>
<sec><title>Identify <italic>Group-Specific</italic>
 Logical Features Based on a <italic>Single-Logical-Feature</italic>
 Predictor</title>
<p>As shown in <bold>Figure <xref ref-type="fig" rid="F2">2</xref>
</bold>
, numerical features were transformed to logical features using Equation (2), and the <italic>single-logical-feature</italic>
 predictors were created according to Equations (3) or (4). The performance of a predictor was evaluated by ASS, an average of sensitivity and specificity. If a <italic>single-logical-feature</italic>
 predictor achieves ASS ≥ 𝜃<sub>1</sub>
, the corresponding <italic>k</italic>
-mer is identified to be <italic>group specific</italic>
. The <italic>group-specific</italic>
 logical features are present in one group but absent in another group.</p>
<fig id="F2" position="float"><label>FIGURE 2</label>
<caption><p>The <italic>single-logical-feature</italic>
 predictor. The numerical feature is transformed into the logical feature. Based on the logical value of the feature, the <italic>single-logical-feature</italic>
 predictor is designed, and the corresponding ASS is calculated.</p>
</caption>
<graphic xlink:href="fmicb-09-00872-g002"></graphic>
</fig>
<p>In our study, <italic>𝜃</italic>
<sub>1</sub>
 was set as 0.80, which means that each <italic>group-specific k</italic>
-mer alone can separate two groups of training samples with ASS ≥ 0.8 solely. Some researchers would prefer a statistical test, such as Chi-squared test, to rank the features. To accommodate this preference, we calculated <italic>p</italic>
-values of Chi-squared test for the same feature set. Among the two feature lists with the 400 largest ASS values and the 400 smallest <italic>p</italic>
-values, 392 features were present in both lists in the same order. Therefore, both ASS and Chi-squared test provide consistent ranks of the features. In our pipeline, users have the option to choose either ASS or Chi-squared test as evaluation metrics.</p>
</sec>
<sec><title>Identify <italic>Group-Specific</italic>
 Numerical Features Based on a <italic>Single-Numerical-Feature</italic>
 Logistic-Regression Predictor</title>
<p>First, Wilcoxon rank-sum test is applied to the numerical features to select <italic>k</italic>
-mers with differential abundance (<italic>p</italic>
-value ≤ 𝜃<sub>2</sub>
) between two groups. However, our main goal is to identify features with the most discriminant power. Therefore, we fit logistic regression for each numerical <italic>k</italic>
-mer feature that passed the Wilcoxon rank-sum test over all the training samples, and we term this as <italic>single-numerical-feature</italic>
 logistic-regression predictor. We used ASS ≥ 𝜃<sub>3</sub>
 as a metric to identify <italic>group-specific</italic>
 numerical <italic>k</italic>
-mers. In our study, we used 𝜃<sub>2</sub>
 = 0.01 and 𝜃<sub>3</sub>
 = 0.8</p>
</sec>
<sec><title>Random Forests Prediction of Disease Status With the Combination of Multiple Features</title>
<p>The <italic>single-logical-feature</italic>
 predictor and <italic>single-numerical</italic>
 logistic-regression predictor are the classifiers based on a single <italic>k</italic>
-mer feature. Because of the complicated association between human microbiome and disease, classifiers using multiple features are expected to be more efficient than those with single features. Therefore, we used random forests to design a classifier with multiple <italic>group-specific</italic>
 features. To remove redundant features, we calculated the Pearson correlation coefficients (PCC) between the feature vectors of every pair of <italic>k-</italic>
mers. If a pair of <italic>k</italic>
-mers has a PCC value higher than a preset threshold, such as 0.75, one <italic>k</italic>
-mer feature was randomly discarded. The remaining features were ranked according to the variable importance measures of Breiman’s random forests method (<xref rid="B4" ref-type="bibr">Breiman, 2001</xref>
), and the top features were adopted to design a random forests classifier.</p>
</sec>
<sec><title>Assembly of <italic>Group-Specific</italic>
 Sequences</title>
<p>Using CAP3 (<xref rid="B11" ref-type="bibr">Huang and Madan, 1999</xref>
), the identified <italic>group-specific k</italic>
-mers based on logical and numerical features were, respectively, assembled to longer sequences. For quality control, the assembled sequences longer than a certain threshold (200 bp in our study) are considered as <italic>group-specific</italic>
 sequences.</p>
</sec>
</sec>
<sec><title>Parallel Computing Workflow on <italic>Apache Spark</italic>
</title>
<p>The running time and memory required to integrate feature matrix and filter out less useful features expand dramatically with the increase of <italic>k</italic>
-mer length and sample size. Fortunately, these processing steps are suitable for parallel computing. Therefore, we developed MetaGO workflow on <italic>Apache Spark</italic>
 (<xref rid="B48" ref-type="bibr">Zaharia et al., 2010</xref>
) to implement parallel computing. <italic>Spark</italic>
 can run in local mode or cluster mode. Thus, MetaGO can run on a local stand-alone multi-core server or a distributed cluster on HDFS. The detailed description of the workflow is given in <bold>Supplementary File <xref ref-type="supplementary-material" rid="SM1">S1</xref>
</bold>
. The workflow is available on <ext-link ext-link-type="uri" xlink:href="https://github.com/VVsmileyx/MetaGO">https://github.com/VVsmileyx/MetaGO</ext-link>
.</p>
</sec>
<sec><title>Experimental Design</title>
<sec><title>The Setting of <italic>k</italic>
-mer Length</title>
<p>A previous study showed that sufficiently long <italic>k</italic>
-mers are usually specific to a genome (<xref rid="B8" ref-type="bibr">Fofanov et al., 2004</xref>
). According to an observation based on 100 pairs of bacterial genomes, the average ratio of common <italic>k</italic>
-mers between the genomes is <1.02% when <italic>k</italic>
 ≥ 30 (<xref rid="B17" ref-type="bibr">Le et al., 2015</xref>
). Therefore, <italic>k</italic>
-mers longer than 30 bp would possess sufficiently high sensitivity to capture the discriminate characteristics to separate two groups; thus, theoretically, longer <italic>k</italic>
-mers are better suited to this task. At the same time, however, <italic>k</italic>
-mer length is limited by four factors: sample size (the number of samples), sequencing depth, computational cost, and read length. First, the dimension of feature space grows exponentially with <italic>k</italic>
. Owing to the curse of dimensionality, a limited number of samples would lead to a high false-positive rate. Therefore, a large sample size is required to obtain high specificity. Second, when sequencing depth is not deep enough to cover all the metagenomic regions, the frequencies of long <italic>k</italic>
-mers would not be accurate. Third, with the increase of <italic>k</italic>
-mer length, the huge number of <italic>k</italic>
-mers leads to the explosion of memory and storage. Fourth, when the <italic>k</italic>
-mer length is close to read length, the frequencies of <italic>k</italic>
-mers are contaminated by the truncated sites under limited sequencing depth. Therefore, we set the <italic>k</italic>
-mer length to be 30–40 as the reasonable tradeoff among sensitivity, specificity, and computational cost.</p>
</sec>
<sec><title>Simulated Metagenomic Dataset</title>
<p>Based on the relative abundances of frequent microbial genomes within human gut analyzed by <xref rid="B29" ref-type="bibr">Qin et al. (2010)</xref>
 (Figure 3 of their paper), we selected the top 10 most frequent genomes as the basis components of the simulation. The relative abundances in the control group were approximated from the medians of Figure 3 of that study (<xref rid="B29" ref-type="bibr">Qin et al., 2010</xref>
), which were converted into the cell proportions of the 10 genomes in all the cells within the community. In addition, we added another strain <italic>Bacteroides thetaiotaomicron</italic>
 VPI-5482 to the patient group, and this strain shares about 87% common sequences with the existing <italic>B. thetaiotaomicron</italic>
 7330. Meanwhile, we assigned Genome <italic>Bacteroides caccae</italic>
 ATCC 43185 threefold abundance in the control group than in the patient group. The remaining nine genomes have identical abundance distributions between the healthy individual and the patient groups. The detail setting is shown in <bold>Table <xref ref-type="table" rid="T1">1</xref>
</bold>
. We used MetaSim (<xref rid="B34" ref-type="bibr">Richter et al., 2008</xref>
) to generate 15 metagenomic samples for case and control groups, respectively. For each group, the absolute values of Gaussian noises of mean zero and standard derivation equal to each central relative abundance were added to the center relative abundance vector. Each sample contains ∼10,000,000 reads. In the evaluations, the proportion of identified <italic>group-specific k</italic>
-mers that can be aligned to disease-specific sequence regions is called “precision,” and the proportion of disease-specific sequence regions that can be covered by <italic>group-specific 40</italic>
-mers is called “recall.”</p>
<table-wrap id="T1" position="float"><label>Table 1</label>
<caption><p>The relative abundance profile of different genomes in control and patient groups for the simulated dataset.</p>
</caption>
<table frame="hsides" rules="groups" cellspacing="5" cellpadding="5"><thead><tr><th valign="top" align="left" rowspan="1" colspan="1">Genomes</th>
<th valign="top" align="left" rowspan="1" colspan="1">NCBI Accession ID</th>
<th valign="top" align="center" rowspan="1" colspan="1">Relative_Abundance_H<sup>∗</sup>
</th>
<th valign="top" align="center" rowspan="1" colspan="1"></th>
<th valign="top" align="center" rowspan="1" colspan="1">Relative_Abundance_P<sup>∗</sup>
</th>
</tr>
</thead>
<tbody><tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Bacteroides thetaiotaomicron</italic>
 7330</td>
<td valign="top" align="left" rowspan="1" colspan="1">NZ_CP012937.1</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="center" colspan="2" rowspan="1">18%</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Bacteroides thetaiotaomicron</italic>
 VPI-5482</td>
<td valign="top" align="left" rowspan="1" colspan="1">NC_004663.1</td>
<td valign="top" align="center" rowspan="1" colspan="1">0</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="center" rowspan="1" colspan="1">6%</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Bacteroides uniformis</italic>
 CL03T12C37</td>
<td valign="top" align="left" rowspan="1" colspan="1">NZ_JH724268.1</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="center" colspan="2" rowspan="1">7%</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Alistipes putredinis isolate</italic>
 CAG</td>
<td valign="top" align="left" rowspan="1" colspan="1">MNQH01000001.1</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="center" colspan="2" rowspan="1">16%</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Parabacteroides merdae</italic>
 2789STDY5834848</td>
<td valign="top" align="left" rowspan="1" colspan="1">CZAG01000002.1</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="center" colspan="2" rowspan="1">10%</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Dorea longicatena</italic>
 2789STDY5834914</td>
<td valign="top" align="left" rowspan="1" colspan="1">NZ_CZAY01000001.1</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="center" colspan="2" rowspan="1">10%</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Ruminococcus bromii</italic>
 L2-63</td>
<td valign="top" align="left" rowspan="1" colspan="1">FP929051.1</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="center" colspan="2" rowspan="1">10%</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Bacteroides caccae</italic>
 ATCC 43185</td>
<td valign="top" align="left" rowspan="1" colspan="1">NZ_CP022412.2</td>
<td valign="top" align="center" rowspan="1" colspan="1">9%</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="center" rowspan="1" colspan="1">3%</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Clostridium</italic>
 sp. SS2/1</td>
<td valign="top" align="left" rowspan="1" colspan="1">NZ_DS547029.1</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="center" colspan="2" rowspan="1">8%</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Eubacterium hallii isolate</italic>
 EH1</td>
<td valign="top" align="left" rowspan="1" colspan="1">NZ_LT907978.1</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="center" colspan="2" rowspan="1">6%</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Ruminococcus torques</italic>
 L2-14</td>
<td valign="top" align="left" rowspan="1" colspan="1">FP929055.1</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="center" colspan="2" rowspan="1">6%</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
</tr>
</tbody>
</table>
<table-wrap-foot><attrib><italic>The relative abundances were the proportions of the number of copies of 11 genomes within the community. Bacteroides thetaiotaomicron VPI-5482 is present only in the patient group, and it is another strain of B. thetaiotaomicron. Bacteroides caccae ATCC 43185 has threefold abundance in the control group of that in the patient group. <sup>∗</sup>
H, healthy control; P, patient.</italic>
</attrib>
</table-wrap-foot>
</table-wrap>
</sec>
<sec><title>Metagenomic Liver Cirrhosis-Associated Dataset</title>
<p>In recent studies, alterations in human gut microbiota have been linked to LC (<xref rid="B31" ref-type="bibr">Qin et al., 2014</xref>
; <xref rid="B44" ref-type="bibr">Wiest et al., 2014</xref>
). We analyzed the human fecal metagenomic samples (<xref rid="B31" ref-type="bibr">Qin et al., 2014</xref>
) from 98 LC patients and 83 healthy controls, as well as an extra dataset composed of 25 independent patients and 31 controls. The data were sequenced with Illumina HiSeq 2000. In the experiment, two-thirds of the 98 patients and 83 control samples were randomly selected as the training set to identify <italic>group-specific k</italic>
-mers, and the remaining one-third as the validation set. Finally, the extra 25 patients and 31 controls were applied to test the <italic>group-specific k</italic>
-mers independently.</p>
</sec>
<sec><title>Metagenomic IBD-Associated and WT2D-Associated Datasets</title>
<p>The IBD dataset is composed of the human fecal metagenomic samples from 25 IBD patients and 97 controls (<xref rid="B29" ref-type="bibr">Qin et al., 2010</xref>
). These samples were sequenced on Illumina GAIIx from the MetaHIT project (<xref rid="B12" ref-type="bibr">Human Microbiome Project Consortium, 2012</xref>
). The WT2D dataset is composed of samples from 53 T2D patients and 43 healthy controls from European women (<xref rid="B15" ref-type="bibr">Karlsson et al., 2013</xref>
). These samples were sequenced on Illumina HiSeq 2000. Both datasets had been predicted using various types of features (<xref rid="B6" ref-type="bibr">Cui and Zhang, 2013</xref>
; <xref rid="B15" ref-type="bibr">Karlsson et al., 2013</xref>
; <xref rid="B27" ref-type="bibr">Pasolli et al., 2016</xref>
). In our study, we adopted the experimental setting of a previous study (<xref rid="B27" ref-type="bibr">Pasolli et al., 2016</xref>
), in which 20 independent runs of 10-fold cross-validation were used to evaluate the classification.</p>
</sec>
</sec>
</sec>
<sec><title>Results</title>
<sec><title>The Simulated Metagenomic Dataset</title>
<p>For logical features, there were 1,646,128 <italic>group-specific 40</italic>
-mers using ASS ≥ 0.8 as a threshold. And 99.999% of the <italic>40</italic>
-mers were patient specific, which means almost all the logical group-specific <italic>40</italic>
-mers exist only in the patient group and are absent in the healthy control group. Among the logical <italic>patient-specific 40</italic>
-mers, 99.11% of them (precision) were exactly aligned to strain <italic>B. thetaiotaomicron</italic>
 VPI-5482 (the strain present in the patient group only) and covered 98.89% (recall) of the regions that are not in the genome of the other strain <italic>B. thetaiotaomicron</italic>
 7330. None of the <italic>group-specific</italic>
 40-mers were aligned to <italic>B. thetaiotaomicron</italic>
 7330, which has the same abundance on both groups. The logical <italic>group-specific 40</italic>
-mers mainly indicate genomes present in one group but not in another group.</p>
<p>The remaining features were represented as numerical <italic>40</italic>
-mers, and there were 7,891,412 <italic>group-specific 40</italic>
-mers using <italic>p</italic>
 < 0.05 and ASS ≥ 0.8 as the thresholds. And 4,452,553 (56.42%) of them were exactly matched to <italic>B. caccae</italic>
 ATCC 43185 and covered 99.61% (recall) of the whole genome, which is differentially abundant between the healthy control and the case groups. Among the remaining <italic>40</italic>
-mers, 3,257,251 (41.3%) of them were aligned to the common regions between <italic>B. thetaiotaomicron</italic>
 VPI-5482 and <italic>B. thetaiotaomicron</italic>
 7330, and covered 96.39% (recall) of the common sequences. Because for the patient group, the abundance of common sequences includes VPI-5482 and <italic>B. thetaiotaomicron</italic>
 7330, but the control group only includes <italic>B. thetaiotaomicron</italic>
 7330, the common sequences are differentially abundant. In total, 97.72% (precision) of the identified <italic>group-specific</italic>
 numerical <italic>40</italic>
-mers were aligned to the differentially abundant regions between the two groups.</p>
<p>The identified <italic>patient-specific</italic>
 and <italic>control-specific 40</italic>
-mers from logical and numerical features were assembled into contigs, respectively. For the assembled <italic>patient-specific</italic>
 contigs, there were 20 of them with length ≥10,000 bp and all these contigs were matched to the <italic>patient-specific</italic>
 strain <italic>B. thetaiotaomicron</italic>
 VPI-5482 with 99.79–100% identity and 100% coverage. The coverage rate here means the proportion of contig sequence mapped to the strain. In contrast, these contigs cannot be matched to <italic>B. thetaiotaomicron</italic>
 7330, and the maximum common sequences between contigs and <italic>B. thetaiotaomicron</italic>
 7330 genome were no longer than 47 bp. For assembled <italic>control-specific</italic>
 contigs, there were 24 of them with length ≥5000 bp and all of them were mapped to the differentially abundant genome <italic>B. caccae</italic>
 with 100% identity and 100% coverage using BLAST (<xref rid="B2" ref-type="bibr">Altschul et al., 1997</xref>
).</p>
<p>To evaluate the effect of <italic>k</italic>
-mer length, we ran MetaGO on <italic>10</italic>
-mer, <italic>20</italic>
-mer, <italic>30</italic>
-mer, <italic>50</italic>
-mer, and <italic>60</italic>
-mer, and the corresponding precision and recall are shown in <bold>Table <xref ref-type="table" rid="T2">2</xref>
</bold>
. For the simulated dataset, When <italic>k</italic>
 = 10, no <italic>group-specific</italic>
 logical <italic>k</italic>
-mers were identified. The recall rates for the identified numerical <italic>k</italic>
-mers were only 25.34% for <italic>B. caccae</italic>
 ATCC 43185 and 22.45% for the common regions between <italic>B. thetaiotaomicron</italic>
 VPI-5482 and <italic>B. thetaiotaomicron</italic>
 7330. When <italic>k</italic>
 ≥ 20, the effects of the <italic>k</italic>
-mer length on the performance of our methods were small. The precision increased slightly with the <italic>k</italic>
-mer length from 99.03 to 99.35% for logical <italic>k</italic>
-mers and from 96.81 to 98.58% for numerical <italic>k</italic>
-mers, consistent with the intuition that long <italic>k</italic>
-mers can capture more specific information of each group. On the other hand, though almost all the recall rates were all above 90%, the recall first increased with <italic>k</italic>
-mer length until <italic>k</italic>
 = 40 and then decreased, which might be caused by insufficient coverage for long <italic>k</italic>
-mers.</p>
<table-wrap id="T2" position="float"><label>Table 2</label>
<caption><p>The precision and recall of MetaGO for the simulated dataset using different <italic>k</italic>
-mer lengths.</p>
</caption>
<table frame="hsides" rules="groups" cellspacing="5" cellpadding="5"><thead><tr><th valign="top" align="left" colspan="2" rowspan="1"><italic>k</italic>
-mer length</th>
<th valign="top" align="center" rowspan="1" colspan="1">10 (%)</th>
<th valign="top" align="center" rowspan="1" colspan="1">20 (%)</th>
<th valign="top" align="center" rowspan="1" colspan="1">30 (%)</th>
<th valign="top" align="center" rowspan="1" colspan="1">40 (%)</th>
<th valign="top" align="center" rowspan="1" colspan="1">50 (%)</th>
<th valign="top" align="center" rowspan="1" colspan="1">60 (%)</th>
</tr>
</thead>
<tbody><tr><td valign="top" align="left" rowspan="1" colspan="1">Logicalized <italic>k</italic>
-mers</td>
<td valign="top" align="left" rowspan="1" colspan="1">Precision</td>
<td valign="top" align="center" rowspan="1" colspan="1">–<sup>∗</sup>
</td>
<td valign="top" align="center" rowspan="1" colspan="1">99.03</td>
<td valign="top" align="center" rowspan="1" colspan="1">99.05</td>
<td valign="top" align="center" rowspan="1" colspan="1">99.11</td>
<td valign="top" align="center" rowspan="1" colspan="1">99.45</td>
<td valign="top" align="center" rowspan="1" colspan="1">99.35</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="left" rowspan="1" colspan="1">Recall</td>
<td valign="top" align="center" rowspan="1" colspan="1">–<sup>∗</sup>
</td>
<td valign="top" align="center" rowspan="1" colspan="1">89.79</td>
<td valign="top" align="center" rowspan="1" colspan="1">92.16</td>
<td valign="top" align="center" rowspan="1" colspan="1">98.89</td>
<td valign="top" align="center" rowspan="1" colspan="1">97.01</td>
<td valign="top" align="center" rowspan="1" colspan="1">95.23</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">Numercial <italic>k</italic>
-mers</td>
<td valign="top" align="left" rowspan="1" colspan="1">Precision</td>
<td valign="top" align="center" rowspan="1" colspan="1">99.63</td>
<td valign="top" align="center" rowspan="1" colspan="1">96.81</td>
<td valign="top" align="center" rowspan="1" colspan="1">96.07</td>
<td valign="top" align="center" rowspan="1" colspan="1">97.72</td>
<td valign="top" align="center" rowspan="1" colspan="1">98.22</td>
<td valign="top" align="center" rowspan="1" colspan="1">98.58</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="left" rowspan="1" colspan="1">Averaged recall</td>
<td valign="top" align="center" rowspan="1" colspan="1">23.89</td>
<td valign="top" align="center" rowspan="1" colspan="1">95.70</td>
<td valign="top" align="center" rowspan="1" colspan="1">97.93</td>
<td valign="top" align="center" rowspan="1" colspan="1">98.00</td>
<td valign="top" align="center" rowspan="1" colspan="1">96.82</td>
<td valign="top" align="center" rowspan="1" colspan="1">94.76</td>
</tr>
</tbody>
</table>
<table-wrap-foot><attrib><italic>The “averaged recall” in numerical k-mers is the average of the recall of B. caccae ATCC 43185 genome and the recall of the common regions between strain B. thetaiotaomicron 7330 and B. thetaiotaomicron VPI-5482. <sup>∗</sup>
When k = 10, there is no logicalized k-mer identified, so it is marked with “–”.</italic>
</attrib>
</table-wrap-foot>
</table-wrap>
<p>The experimental results demonstrate that the identification of <italic>group-specific 40</italic>
-mers can not only capture genomes with different abundance but also identify <italic>group-specific</italic>
 markers under the strain-level resolution. Even though the two strains <italic>B. thetaiotaomicron</italic>
 VPI-5482 and <italic>B. thetaiotaomicron</italic>
 7330 share 87% common sequences, our method still captured the <italic>group-specific</italic>
 sequences.</p>
</sec>
<sec><title>The LC-Associated Metagenomic Dataset</title>
<p>MetaGO was applied to the large-scale metagenomic LC-associated dataset (<xref rid="B31" ref-type="bibr">Qin et al., 2014</xref>
). With sufficient training samples and long read length, the <italic>k</italic>
-mer length was set as <italic>k</italic>
 = 40. A total of ∼10<sup>9</sup>
 non-zero <italic>40</italic>
-mers were found in the feature matrix of training samples. After removing the highly sparse <italic>40-</italic>
mer features, ∼10<sup>6</sup>
 features were left.</p>
<sec><title>Identify <italic>Group-Specific</italic>
 Features</title>
<p>Using ASS > 0.8 as the threshold, 37,302 logical features were identified as <italic>group-specific 40-</italic>
mers. That is, any one of these <italic>40</italic>
-mers could achieve ASS > 0.8 using its corresponding <italic>single-logical-feature</italic>
 predictor on training samples. We then used each of these 37,302 <italic>single-logical-feature</italic>
 predictors to predict LC in the validation and testing sets. As shown in the histogram of <bold>Figure <xref ref-type="fig" rid="F3">3A</xref>
</bold>
, ASS values of validation and testing were centered at 0.85 and 0.78, respectively. Among the 37,302 <italic>single-logical-feature</italic>
 predictors, 35,404 (95%) <italic>group-specific 40-</italic>
mers achieved ASS ≥ 0.8 on the validation set, and 12,750 (36%) achieved ASS ≥ 0.8 on the testing set. Furthermore, 345 numerical features were identified as <italic>group-specific 40</italic>
-mers with ASS ≥ 0.8, where 248 and 194, respectively, achieved ASS ≥ 0.8 on validation and testing sets using corresponding <italic>single-numerical-feature</italic>
 logistic regression predictors. All 37,302 logical and 345 numerical <italic>40</italic>
-mers were <italic>LC-specific</italic>
 in that they were all present only in the fecal samples of LC patients, but not in the samples from healthy controls. The identified <italic>group-specific 40</italic>
-mers for the LC dataset are available in <bold>Supplementary File <xref ref-type="supplementary-material" rid="SM2">S2</xref>
</bold>
.</p>
<fig id="F3" position="float"><label>FIGURE 3</label>
<caption><p><bold>(A)</bold>
 The distribution of ASS values of the 37,302 <italic>single-logical-feature</italic>
 predictors and 345 <italic>single-numerical</italic>
 logistic-regression predictors on the identified <italic>group-specific</italic>
 features for training, validation, and testing sets. These predictors achieved better performance in the validation set compared to the training set. A total of 35,652 <italic>group-specific</italic>
 features achieved ASS ≥ 0.8 for the validation set, and 12,944 of them achieved ASS ≥ 0.8 for the testing set. <bold>(B)</bold>
 ROC curves of the random forests classifier with the top 10 features on validation and testing sets. Using the top 10 <italic>group-specific 40</italic>
-mers, the random forests classifier achieved AUC of 0.963, 0.969, and 0.942 on training, validation, and testing sets, respectively.</p>
</caption>
<graphic xlink:href="fmicb-09-00872-g003"></graphic>
</fig>
<p>We also implemented a controlled trial by shuffling the labels of the training samples randomly. Using the same pipeline and settings, only 247 <italic>40</italic>
-mers achieved ASS ≥ 0.7, and the highest value was 0.73. This control trial indicates that most of the identified <italic>group-specific 40</italic>
-mers for LC were more likely to be true rather than due to false positives.</p>
</sec>
<sec><title>Classification With the <italic>Group-Specific</italic>
 40-mer(s)</title>
<p>We used classification performance to evaluate the discriminative capability of the identified <italic>group-specific 40</italic>
-mers. First, we classified the healthy and LC groups with single features. The <italic>single-logical-feature</italic>
 predictor that obtained the highest ASS = 0.87 on the training set achieved ASS = 0.885 (sensitivity = 0.81 and specificity = 0.96) on the validation set and 0.87 (sensitivity = 0.84 and specificity = 0.90) on the independent testing set. Second, we built a classifier using a set of features. Using the top 10 <italic>group-specific 40</italic>
-mers, a random forests classifier achieved AUCs of 0.963 on training, 0.969 on validation, and 0.942 on testing sets, respectively. The corresponding ROC curves are shown in <bold>Figure <xref ref-type="fig" rid="F3">3B</xref>
</bold>
. As shown in <bold>Table <xref ref-type="table" rid="T3">3</xref>
</bold>
, <xref rid="B31" ref-type="bibr">Qin et al. (2014)</xref>
 obtained AUC = 0.918, 0.838, and 0.836 on training, validation, and testing sets with SVM using 15 marker genes as features. <xref rid="B27" ref-type="bibr">Pasolli et al. (2016)</xref>
 obtained AUC = 0.946 ± 0.036 with random forests using 542 species-abundance features and 0.963 ± 0.027 with SVM using 91,756 strain-specific markers features over 20 independent runs of 10-fold cross-validations, where cross-validations gave much more optimistic results, and many more features were adopted. The experiments show that <italic>group-specific 40</italic>
-mers achieved better classification performance with fewer features.</p>
<table-wrap id="T3" position="float"><label>Table 3</label>
<caption><p>Comparison of the prediction performance of different methods based on the LC dataset.</p>
</caption>
<table frame="hsides" rules="groups" cellspacing="5" cellpadding="5"><thead><tr><th valign="top" align="left" rowspan="1" colspan="1">Feature</th>
<th valign="top" align="left" rowspan="1" colspan="1"></th>
<th valign="top" align="left" rowspan="1" colspan="1"><italic>40</italic>
-mer</th>
<th valign="top" align="left" rowspan="1" colspan="1"><italic>40</italic>
-mer</th>
<th valign="top" align="left" rowspan="1" colspan="1">Gene markers<sup>††</sup>
</th>
<th valign="top" align="left" rowspan="1" colspan="1">Species abundance<sup>†</sup>
</th>
<th valign="top" align="left" rowspan="1" colspan="1">Presence of strain- specific markers<sup>†</sup>
</th>
</tr>
</thead>
<tbody><tr><td valign="top" align="left" colspan="2" rowspan="1">Experiment</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="center" rowspan="1" colspan="1">Training (66P+56H)<break></break>
Validation (32P+27H)<break></break>
Testing (25P+31H)</td>
<td valign="top" align="center" rowspan="1" colspan="1"></td>
<td valign="top" align="center" colspan="2" rowspan="1">20 runs of 10-fold<break></break>
cross-validation (114P+118H)</td>
</tr>
<tr><td valign="top" align="left" colspan="2" rowspan="1">Number of feature</td>
<td valign="top" align="left" rowspan="1" colspan="1"><bold>1</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1"><bold>10</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">15</td>
<td valign="top" align="left" rowspan="1" colspan="1">542</td>
<td valign="top" align="left" rowspan="1" colspan="1">120553</td>
</tr>
<tr><td valign="top" align="left" colspan="2" rowspan="1">Classifier</td>
<td valign="top" align="left" rowspan="1" colspan="1"><bold>Single</bold>
<break></break>
<bold>logical</bold>
<break></break>
<bold>feature</bold>
<break></break>
<bold>predictor</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1"><bold>Random</bold>
<break></break>
<bold>forests</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">Support<break></break>
vector<break></break>
machine</td>
<td valign="top" align="left" rowspan="1" colspan="1">Random<break></break>
forests</td>
<td valign="top" align="left" rowspan="1" colspan="1">Support<break></break>
vector<break></break>
machine</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AUC</td>
<td valign="top" align="left" rowspan="1" colspan="1">Training<break></break>
validation<break></break>
testing</td>
<td valign="top" align="left" rowspan="1" colspan="1"><bold>ASS<sup>∗</sup>
 = 0.87</bold>
<break></break>
<bold>ASS = 0.885y</bold>
<break></break>
<bold>ASS = 0.87</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1"><bold>0.963</bold>
<break></break>
<bold>0.969</bold>
<break></break>
<bold>0.942</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">0.918<break></break>
0.838<break></break>
0.836</td>
<td valign="top" align="left" rowspan="1" colspan="1">0.946 ± 0.035</td>
<td valign="top" align="left" rowspan="1" colspan="1">0.963 ± 0.027</td>
</tr>
</tbody>
</table>
<table-wrap-foot><attrib><italic>Using much fewer features, MetaGO achieved better results compared to other methods. The results of MetaGO were in bold. <sup>†</sup>
(<xref rid="B27" ref-type="bibr">Pasolli et al., 2016</xref>
); <sup>††</sup>
(<xref rid="B31" ref-type="bibr">Qin et al., 2014</xref>
); <sup>∗</sup>
average of sensitivity and specificity.</italic>
</attrib>
</table-wrap-foot>
</table-wrap>
</sec>
<sec><title><italic>Group-Specific</italic>
 Sequences</title>
<p>The identified <italic>group-specific 40</italic>
-mers were assembled into <italic>group-specific</italic>
 sequences using CAP3 (<xref rid="B11" ref-type="bibr">Huang and Madan, 1999</xref>
), in which 11 assembled sequences were longer than 200 bp, with length from 210 to 350 bp (available in <bold>Supplementary File <xref ref-type="supplementary-material" rid="SM2">S2</xref>
</bold>
). They were aligned by the sequencing reads from the training and validation sets and the independent testing sets. The coverage distributions over the 11 sequences across all samples were represented as heatmaps in <bold>Figure <xref ref-type="fig" rid="F4">4</xref>
</bold>
. A noticeable difference appears between the two groups. In the group of healthy individuals, the reads of most samples cannot be aligned to the 11 sequences. In the patient group, the 11 sequences were aligned successively by the reads from most patients. The <italic>de novo</italic>
 and reference-free assembly produces longer <italic>group-specific</italic>
 sequences, which enables the discovery of biomarkers.</p>
<fig id="F4" position="float"><label>FIGURE 4</label>
<caption><p>Heatmaps of coverage distribution over the 11 assembled sequences by the metagenomic reads from the training, validation, and testing samples. <bold>(A)</bold>
 Heatmap of the reads coverage of the 11 assembled sequences across the training and validation samples (83 healthy individuals and 98 LC patients). <bold>(B)</bold>
 Heatmap of the reads coverage of the 11 assembled sequences across the testing samples (30 healthy individuals and 25 LC patients). The coverage is the read-alignment depth in each nucleotide normalized by the number of million reads. To avoid the effect of large span, we use the logarithm of (coverage+1) as the numerical value of the heatmaps. The horizontal axis is composed of each nucleotide of the 11 sequences, and the vertical axis is composed of healthy individuals and patients. The upper part of each heatmap is the healthy group, and the lower part is the patient group.</p>
</caption>
<graphic xlink:href="fmicb-09-00872-g004"></graphic>
</fig>
</sec>
<sec><title>Taxonomic Information of the <italic>Group-Specific</italic>
 Markers</title>
<p>We aligned the 11 <italic>LC-specific</italic>
 sequences to genomes with “Nucleotide Blast” in NCBI, and all of the sequences were aligned to two strains of <italic>V. parvula</italic>
, UTDB1-3, and DSM2008, with 100% query coverage and 97–100% identity. In a previous analysis based on the alignments from reads to reference genomes (<xref rid="B31" ref-type="bibr">Qin et al., 2014</xref>
), <italic>V. parvula</italic>
 demonstrated a significant difference in abundance between the two groups of LC patients and healthy individuals.</p>
<p>All 37,302 <italic>group-specific</italic>
 logical features and 345 <italic>group-specific</italic>
 numerical features were also blasted to reference genomes in NCBI, 31,067 of logical and 268 of numerical <italic>40</italic>
-mers could be matched to <italic>V. parvula</italic>
 strain UTDB1-3, and 29,712 of logical and 262 of numerical <italic>40</italic>
-mers could be matched to <italic>V. parvula</italic>
 strain DSM2008. Using <italic>V. parvula</italic>
 strain UTDB1-3 as an example, <bold>Figure <xref ref-type="fig" rid="F5">5A</xref>
</bold>
 shows the coverage of the whole genome (2.17 Mbp) by the <italic>LC-specific 40</italic>
-mers. The horizontal axis is the whole genome. The <italic>40</italic>
-mers covered most parts of the genome. <bold>Figures <xref ref-type="fig" rid="F5">5B</xref>
–<xref ref-type="fig" rid="F5">D</xref>
</bold>
 are the zoomed-in alignments and coverages of the genome: 108,308–122,356, 2,037,894–2,038,165, and 2,038,052–2,038,119, marked as “zoom1,” “zoom2,” and “zoom3”, respectively, in the figure. It is clear that many regions are highly and consecutively covered by <italic>k</italic>
-mers. As shown in <bold>Figure <xref ref-type="fig" rid="F5">5E</xref>
</bold>
, region 1,423,893–1,423,993 of <italic>V. parvula</italic>
 strain DSM2008 corresponds to “Zoom3” region of <italic>V. parvula</italic>
 strain UTDB1-3. Comparing the regions in these two strains, the consensus mismatch against UTDB1 is absent on DSM2008, while DSM2008 presents another consensus mismatch against DSM2008: 1,423,924. The consistent mismatches against strains UTDB1 and DSM2008 in <italic>V. parvula</italic>
 indicate the possible existence of an unknown strain of <italic>V. parvula</italic>
, which would exist in the gut of LC patients but be absent in the gut of healthy controls.</p>
<fig id="F5" position="float"><label>FIGURE 5</label>
<caption><p>The alignments of the identified <italic>group-specific 40</italic>
-mers to the genome sequence of <italic>V. parvula</italic>
 strain UTDB1-3. <bold>(A)</bold>
 The alignment distribution over the whole genome. <bold>(B)</bold>
 The alignments and coverages of region 108,308–122,356 (Zoom1). The red and blue bars denote the <italic>40</italic>
-mers matched to reference genome sequence forward and backward, respectively. <bold>(C)</bold>
 The alignments and coverages of region 2,037,894–2,038,165 (Zoom2). <bold>(D)</bold>
 The alignments and coverages of region 2,038,053–2,038,119 (Zoom3) with consensus mismatches on 2,038,082. <bold>(E)</bold>
 The alignments and coverages of region 1,423,893–1,423,993 of <italic>V. parvula</italic>
 strain DSM2008. This region corresponds to the Zoom3 region of <italic>V. parvula</italic>
 strain UTDB1-3. Comparing the two regions in the two strains, the consensus mismatch (in green color in <bold>D</bold>
) on UTDB1 is absent on DSM2008, but DSM2008 presents another consensus mismatch (in green color in <bold>E</bold>
) on DSM2008: 1,423,924.</p>
</caption>
<graphic xlink:href="fmicb-09-00872-g005"></graphic>
</fig>
</sec>
</sec>
<sec><title>The IBD-Associated and WT2D- Associated Metagenomic Datasets</title>
<p>The additional two disease-associated metagenomic datasets were analyzed with 20 independent runs of 10-fold cross-validation to evaluate the classification performance for easy comparison with previous studies. We emphasized that feature preprocessing and selection were done using only the training set, thereby avoiding biased and overly optimistic performance (<xref rid="B49" ref-type="bibr">Zhang et al., 2006</xref>
; <xref rid="B27" ref-type="bibr">Pasolli et al., 2016</xref>
).</p>
<sec><title>The IBD-Associated Dataset</title>
<p>For each fold test of 10-fold cross-validation, about 7000 <italic>group-specific</italic>
 logical features with ASS ≥ 0.8, but no <italic>group-specific</italic>
 numerical features, were identified. The numbers of <italic>group-specific</italic>
 features varied with different fold tests. Because of the relatively small sample size, <italic>30</italic>
-mers were set as features. For each <italic>group-specific 30</italic>
-mer, its <italic>single-logical-feature</italic>
 predictor yielded an ASS score on validation. For each round of cross-validation, ∼7000×10 (∼7000 <italic>single-logical-feature</italic>
 predictors and 10-folds) ASS values were obtained on validations. The boxplots in <bold>Figure <xref ref-type="fig" rid="F6">6A</xref>
</bold>
 present the distribution of the ∼70,000 ASS values in 20 rounds of 10-fold cross-validation. The values are between 0.78 and 0.89, and they centered at 0.81–0.82, indicating that individual binary features can achieve ASS ≥ 0.78 solely on validation. The average ASS score is 0.875 ± 0.004 (95% confidence interval). The top 15 ranked features were combined to design a random forests classifier. <bold>Figure <xref ref-type="fig" rid="F6">6B</xref>
</bold>
 presents the ROC curves of 20 independent runs, which were averaged over the 10-folds of cross-validation. The mean AUC of 20 runs is 0.990 ± 0.005 (95% confidence interval), which is much higher than the results reported in previous studies. As shown in <bold>Table <xref ref-type="table" rid="T4">4</xref>
</bold>
, using the same dataset, <xref rid="B27" ref-type="bibr">Pasolli et al. (2016)</xref>
 designed two classifiers. The random forests classifier based on 443 species-abundance features achieved an averaged AUC = 0.893 ± 0.080 under the same experimental setting. The SVM classifier based on the presence of 91,756 strain-specific markers achieved AUC = 0.914 ± 0.084. <xref rid="B46" ref-type="bibr">Xing et al. (2017)</xref>
 obtained AUC = 0.967 with a logistic regression model with LASSO penalty in leave-one-out cross-validation (LOOCV), which used the relative abundances of bins as features. In another study, <xref rid="B6" ref-type="bibr">Cui and Zhang (2013)</xref>
 obtained accuracy = 88%, sensitivity = 92%, and specificity = 84% with 200 <italic>7</italic>
-mer features at LOOCV on 25 healthy subjects and 25 patients, where the samples were the subset of our experiment and LOOCV was more relaxed than 10-fold cross-validation.</p>
<fig id="F6" position="float"><label>FIGURE 6</label>
<caption><p><bold>(A)</bold>
 The IBD-associated dataset: the boxplots of ASS by <italic>single-logical-feature</italic>
 predictors on each one of the identified ∼7000 <italic>group-specific</italic>
 features in the 20 independent runs of 10-fold cross-validation on the IBD dataset. Each boxplot is composed of ∼70,000 ASS values on each round of cross-validation. The ASS values are between 0.78 and 0.89 and centered on 0.81–0.82. The “+” symbol denotes outliers. <bold>(B)</bold>
 The ROC curves of the IBD-associated dataset: The top 15 ranked <italic>30</italic>
-mers were combined to design the random forests classifier. The 20 ROC curves are from the 20 independent runs, and each one is the average over the 10-folds of cross-validation. The mean AUC is 0.990 ± 0.005 (95% confidence interval). <bold>(C)</bold>
 The ROC curves of the WT2D-associated dataset: The top 10 ranked <italic>40</italic>
-mers were combined to design the random forests classifier. The 20 ROC curves are from the 20 independent runs, and each one is the average over the 10-folds of cross-validation. The mean AUC is 0.939 ± 0.011 (95% confidence interval).</p>
</caption>
<graphic xlink:href="fmicb-09-00872-g006"></graphic>
</fig>
<table-wrap id="T4" position="float"><label>Table 4</label>
<caption><p>Comparison of performance of different methods based on the IBD and WT2D datasets.</p>
</caption>
<table frame="hsides" rules="groups" cellspacing="5" cellpadding="5"><thead><tr><th valign="top" align="left" rowspan="1" colspan="1"></th>
<td valign="top" align="center" colspan="7" rowspan="1">IBD dataset<hr></hr>
</td>
</tr>
<tr><th valign="top" align="left" rowspan="1" colspan="1">Experiment</th>
<th valign="top" align="left" colspan="2" rowspan="1"></th>
<th valign="top" align="left" colspan="3" rowspan="1">20 runs of 10-fold cross-validation (25P+97H)</th>
<th valign="top" align="left" rowspan="1" colspan="1"></th>
<th valign="top" align="left" rowspan="1" colspan="1">Five runs of LOOCV (25P+25H)</th>
</tr>
</thead>
<tbody><tr><td valign="top" align="left" rowspan="1" colspan="1"><bold>Feature</bold>
</td>
<td valign="top" align="left" colspan="2" rowspan="1"><bold><italic>30</italic>
-mer</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1"><bold><italic>30</italic>
-mer</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">Species abundance<sup>†</sup>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">Presence of strain-specific markers<sup>†</sup>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">Abundance in contig bin<sup>††††</sup>
</td>
<td valign="top" align="left" rowspan="1" colspan="1"><italic>7</italic>
-mer<sup>††</sup>
</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><bold>Number of feature</bold>
</td>
<td valign="top" align="left" colspan="2" rowspan="1"><bold>1</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1"><bold>15</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">443</td>
<td valign="top" align="left" rowspan="1" colspan="1">91756</td>
<td valign="top" align="left" rowspan="1" colspan="1">Not mentioned</td>
<td valign="top" align="left" rowspan="1" colspan="1">200</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><bold>Classifier</bold>
</td>
<td valign="top" align="left" colspan="2" rowspan="1"><bold>Single</bold>
<break></break>
<bold>logical</bold>
<break></break>
<bold>feature</bold>
<break></break>
<bold>predictor</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1"><bold>Random</bold>
<break></break>
<bold>forests</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">Random<break></break>
forests</td>
<td valign="top" align="left" rowspan="1" colspan="1">Support<break></break>
vector<break></break>
machine</td>
<td valign="top" align="left" rowspan="1" colspan="1">Logistic regression + LASSO</td>
<td valign="top" align="left" rowspan="1" colspan="1">Support<break></break>
vector<break></break>
machine</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><bold>AUC</bold>
</td>
<td valign="top" align="left" colspan="2" rowspan="1"><bold>ASS* = 0.875 ± 0.004</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1"><bold>0.990 ± 0.005</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">0.893 ± 0.080</td>
<td valign="top" align="left" rowspan="1" colspan="1">0.914 ± 0.084</td>
<td valign="top" align="left" rowspan="1" colspan="1">0.967</td>
<td valign="top" align="left" rowspan="1" colspan="1">Accuracy = 0.88</td>
</tr>
<tr><td valign="top" align="left" colspan="8" rowspan="1"><hr></hr>
</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="center" colspan="7" rowspan="1"><bold>WT2D dataset</bold>
<hr></hr>
</td>
</tr>
<tr><td valign="top" align="left" colspan="3" rowspan="1"><bold>Experiment</bold>
</td>
<td valign="top" align="center" colspan="2" rowspan="1"><bold>20 runs of 10-fold cross-validation (52P +43H)</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="center" colspan="2" rowspan="1"><bold>Training (20H+20P)</bold>
<break></break>
<bold>Testing (32P+13H)</bold>
</td>
</tr>
<tr><td valign="top" align="left" colspan="8" rowspan="1"><hr></hr>
</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><bold>Feature</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1"><bold><italic>40</italic>
-mer</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1"><bold><italic>40</italic>
-mer</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">Species abundance<sup>†</sup>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">Presence of strain-specific markers<sup>†</sup>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">Gene markers<sup>†††</sup>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">Abundance of bins with MetaGen</td>
<td valign="top" align="left" rowspan="1" colspan="1"><bold><italic>40-mer</italic>
</bold>
</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><bold>Number of feature</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1"><bold>1</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1"><bold>10</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">381</td>
<td valign="top" align="left" rowspan="1" colspan="1">83456</td>
<td valign="top" align="left" rowspan="1" colspan="1">50</td>
<td valign="top" align="left" rowspan="1" colspan="1">3</td>
<td valign="top" align="left" rowspan="1" colspan="1"><bold>3</bold>
</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><bold>Classifier</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1"><bold>Single</bold>
<break></break>
<bold>logical</bold>
<break></break>
<bold>feature</bold>
<break></break>
<bold>predictor</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1"><bold>Random</bold>
<break></break>
<bold>forests</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">Random<break></break>
forests</td>
<td valign="top" align="left" rowspan="1" colspan="1">Support<break></break>
vector<break></break>
machine</td>
<td valign="top" align="left" rowspan="1" colspan="1">Support<break></break>
vector<break></break>
machine</td>
<td valign="top" align="left" rowspan="1" colspan="1">Random<break></break>
forests</td>
<td valign="top" align="left" rowspan="1" colspan="1"><bold>Random</bold>
<break></break>
<bold>forests</bold>
</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><bold>AUC</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1"><bold>ASS = 0.76 ± 0.003</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1"><bold>0.939 ± 0.011</bold>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">0.772 ± 0.116</td>
<td valign="top" align="left" rowspan="1" colspan="1">0.785 ± 0.104</td>
<td valign="top" align="left" rowspan="1" colspan="1">0.83</td>
<td valign="top" align="left" rowspan="1" colspan="1">0.961 (training)<break></break>
0.685 (testing)</td>
<td valign="top" align="left" rowspan="1" colspan="1"><bold>0.979 (training)</bold>
<break></break>
<bold>0.782 (testing)</bold>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot><attrib><italic>Using much fewer features, MetaGO achieved better results compared to other methods. The results of MetaGO were in bold. There were two experimental setting for IBD dataset, the “Five runs of LOOCV” are the subset of our experiment and LOOCV was more relaxed than 10-fold cross-validation. For the WT2D dataset, <italic>40</italic>
-mers were tested under two experimental setting for comparing with other methods. <sup>†</sup>
(<xref rid="B27" ref-type="bibr">Pasolli et al., 2016</xref>
); <sup>††</sup>
(<xref rid="B6" ref-type="bibr">Cui and Zhang, 2013</xref>
); <sup>†††</sup>
(<xref rid="B31" ref-type="bibr">Qin et al., 2014</xref>
); <sup>††††</sup>
(<xref rid="B46" ref-type="bibr">Xing et al., 2017</xref>
); <sup>∗</sup>
average of sensitivity and specificity.</italic>
</attrib>
</table-wrap-foot>
</table-wrap>
</sec>
<sec><title>The WT2D-Associated Dataset</title>
<p>For each fold test of 10-fold cross-validation, ∼700 <italic>40-</italic>
mers with ASS ≥ 0.75 were identified, and the best ASS score was 0.78. The classifier designed with random forests using 10 top <italic>group-specific 40</italic>
-mer features obtained an average AUC = 0.939 ± 0.011 on the 20 independent runs of 10-fold cross-validation, as shown in <bold>Figure <xref ref-type="fig" rid="F6">6C</xref>
</bold>
. In previous studies under the same experimental setting, the average AUCs were 0.834 using 50 metagenomic clusters as features (<xref rid="B15" ref-type="bibr">Karlsson et al., 2013</xref>
) and 0.785 ± 0.104 using the presence of 83,456 strain-specific markers as features (<xref rid="B27" ref-type="bibr">Pasolli et al., 2016</xref>
). For further comparison, we implemented metagenome-wide <italic>de novo</italic>
 assembly with MegaHIT (<xref rid="B18" ref-type="bibr">Li et al., 2015</xref>
) and then binned the contigs with MetaGen (<xref rid="B46" ref-type="bibr">Xing et al., 2017</xref>
). The relative abundances of bins were used as features to separate the patient and control groups. The total of 96 samples were too large for read assembly, which required >256 GB memory for 80 samples, and the alignments of reads to the contigs were time-consuming. Therefore, 20 patients and 20 healthy individuals were randomly selected as the training set. The remaining 56 samples were used for independent testing. The relative abundances of bins generated by MetaGen were used as features and the random forests classifier was designed on the training set. The definition of relative abundance in MetaGen includes the parameters that should be determined for each species (they assumed each bin is each species) and each sample through the algorithm of MetaGen. When the classifier was tested on the independent set, these parameters for independent samples are also required to be determined. Personal communications with MetaGen’s developers, we revised the code of MetaGen and calculated the feature values of the relative abundances of selected bins for each testing sample. With random forests, MetaGen achieved AUC = 0.685 using 3 features of bins and AUC = 0.735 using 15 features of bins on testing data. With the same training samples, our pipeline obtained AUC = 0.782 with 3 features of <italic>k</italic>
-mers and AUC = 0.794 using 15 features of <italic>k</italic>
-mers with random forests on testing data. Although both methods are reference free, the <italic>group-specific k</italic>
-mers show greater discriminative power than the contig bins for predicting the disease status. Besides, the <italic>de novo</italic>
 assembly and contig binning are time-consuming. For example, it took 120 h to finish the running from read assembly to contig binning on this training set.</p>
<p>From the experiments, IBD is more predictable than T2D. The experiments on the two disease-associated datasets demonstrate that <italic>group-specific k</italic>
-mers achieved much better classification performance with fewer features than previous studies that used the features of short <italic>k</italic>
-mer frequencies, species abundance, and strain marker presence. The experiments confirm the effectiveness of long <italic>k</italic>
-mer features and the strategy of identifying <italic>group-specific</italic>
 features.</p>
</sec>
</sec>
<sec><title>Running the Computational Pipeline on <italic>Apache Spark</italic>
</title>
<p>For the LC dataset, it took 65 h to identify the <italic>group-specific 40</italic>
-mers from 56 healthy and 66 LC training samples (252 GB fasta.gz files), including the calculation of <italic>40</italic>
-mer frequency vector, the integration of feature matrix, and the identification of the <italic>group-specific 40</italic>
-mers. The peak storage space is about 1.5 TB. The above result was run on a local mode of a server with 128 G-memory and Intel(R) Xeon(R) CPU E5-2620 v4 with 8 CPU cores at 2.10 GHz.</p>
</sec>
</sec>
<sec><title>Discussion</title>
<p>Different diseases have different levels of association-complexity with human microbiome. If one disease is significantly associated with a specific microbial strain/species/gene, then the disease is highly predictable using a <italic>single-feature</italic>
 predictor. That is, the disease can be diagnosed with a single microbial biomarker. However, many human diseases are complex in the sense that multiple <italic>group-specific</italic>
 markers are required to characterize the relevance of disease and microbiome. For these diseases, we have shown that combining several <italic>group-specific</italic>
 features can improve prediction accuracy.</p>
<p>In MetaGO, features were selected based on three preset thresholds, including ASS of <italic>single-logical-feature</italic>
 predictor (𝜃<sub>1</sub>
), <italic>p</italic>
-value of Wilcoxon rank-sum test for numerical features (𝜃<sub>2</sub>
), and <italic>single-numerical</italic>
 logistic-regression predictor (𝜃<sub>3</sub>
). For the IBD-associated and LC-associated datasets, we set 𝜃<sub>1</sub>
 = 0.8, 𝜃<sub>2</sub>
 = 0.01, and 𝜃<sub>3</sub>
 = 0.8, respectively. However, for diseases having more complex associations with microbiome, such as T2D (<xref rid="B27" ref-type="bibr">Pasolli et al., 2016</xref>
), 𝜃<sub>1</sub>
 was relaxed to 0.75, 𝜃<sub>2</sub>
 = 0.05 and 𝜃<sub>3</sub>
 = 0.75. Therefore, the three thresholds were, in effect, set according to the expected discriminant power of features and the complexity of association between disease and microbiome.</p>
<p>MetaGO was designed and implemented for two-group case and control datasets. For some studies, there may exist multiple subgroups for the disease, or a pre-disease group. An example of subgroups for disease is the AR-type (marked akinesia and rigidity) and T-type (predominant resting tremor) in Parkinson’s disease (<xref rid="B28" ref-type="bibr">Paulus and Jellinger, 1991</xref>
). Two examples of pre-disease state are impaired glucose tolerance state between T2D and normal glucose tolerance (<xref rid="B15" ref-type="bibr">Karlsson et al., 2013</xref>
) and colorectal adenoma state between carcinoma and healthy state (<xref rid="B7" ref-type="bibr">Feng et al., 2015</xref>
). For the multiple-groups scenario, the way to use MetaGO depends on the research purpose. If the purpose is to identify some microbial organisms that are associated with all sub-groups of the disease, we can combine all individuals belonging to any disease groups and treat them as one disease group. MetaGO can be used to the disease and control groups to identify the common microbial organisms associated with all groups of diseases. On the other hand, if the purpose is to identify certain microbial organisms that are specific to a particular group, we can combine all other individuals into one group and then use MetaGO to identify group-specific-associated microbial organisms. Extending MetaGO for a joint analysis of <italic>group-specific</italic>
 organisms in all the control and different disease groups is a topic of further study.</p>
</sec>
<sec><title>Conclusion</title>
<p>In this study, we developed a computational framework, MetaGO, that is free from reference sequences, metagenome-wide <italic>de novo</italic>
 assembly, and sequence alignment, to identify <italic>group-specific</italic>
 sequences between two groups of microbial communities using long <italic>k</italic>
-mer features. The <italic>k</italic>
-mer length was set between 30 and 40 based on the tradeoff among sensitivity, specificity, and computational cost. The identified <italic>group-specific k</italic>
-mers present improved discriminant power for diagnosing diseases using human gut metagenomics data compared with previous studies.</p>
<p>To overcome the computational challenge of long <italic>k</italic>
-mer features, an open-source, parallel-computing pipeline was developed on <italic>Apache Spark</italic>
 to save computational resources and reduce running time. In this study, we applied MetaGO to analyze metagenomic disease-associated datasets. It should be noted that the pipeline is also suitable for identifying <italic>group-specific k</italic>
-mers for all types of high-throughput sequencing data where samples are collected from different groups, such as disease-associated human genome sequencing data or other phenotype-associated metagenomic datasets from different environments.</p>
<p>Our experiments validated improvements made by the identified <italic>group-specific k</italic>
-mer features compared to previous studies using other types of features. The <italic>group-specific</italic>
 sequences offer deep and detailed insights required to understand the differences between groups because the method essentially identifies a sequence that is present, or rich, in one group, but absent, or scarce, in another group, the fundamental working principle of <italic>group-specific</italic>
 sequences. We found that biological explorations based on <italic>group-specific</italic>
 sequences are consistent with those from previous biological experiments, but additionally offered the potential for new discoveries. Therefore, using long <italic>k</italic>
-mer sequence signatures is an effective way to discover biological features, paving the way for a new paradigm of biomarker discovery in the context of host phenotypes. MetaGO enables the detection of <italic>group-specific</italic>
 features and development of prediction models using a single feature, or a combination of a few features, which helps to reduce the complexity of the model, while increasing the potential feasibility of follow-up discovery of discriminative microbial biomarker(s) for the easy diagnosis of human diseases.</p>
</sec>
<sec><title>Availability of Supporting Data and Source Codes</title>
<p>Source codes and testing data are available at <ext-link ext-link-type="uri" xlink:href="https://github.com/VVsmileyx/MetaGO">https://github.</ext-link>
<ext-link ext-link-type="uri" xlink:href="https://github.com/VVsmileyx/MetaGO">com/VVsmileyx/MetaGO</ext-link>
. The metagenomic sequencing datasets of IBD, LC, and T2D of European women were from the European Bioinformatics Institute’s European Nucleotide Archive under accession numbers (EMBL: <ext-link ext-link-type="DDBJ/EMBL/GenBank" xlink:href="ERP000108">ERP000108</ext-link>
, <ext-link ext-link-type="DDBJ/EMBL/GenBank" xlink:href="ERP005860">ERP005860</ext-link>
, and <ext-link ext-link-type="DDBJ/EMBL/GenBank" xlink:href="ERP002469">ERP002469</ext-link>
).</p>
</sec>
<sec><title>Author Contributions</title>
<p>YW, FS, and TC planned the project. YW and ZY designed the model and experiments. LF performed the experiments. YW, JR, and FS analyzed the data. LF contributed materials/analysis tools. YW, JR, ZY, and FS wrote the main manuscript. All authors read and approved the final manuscript.</p>
</sec>
<sec><title>Conflict of Interest Statement</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</body>
<back><fn-group><fn fn-type="financial-disclosure"><p><bold>Funding.</bold>
 The research was supported by the National Natural Science Foundation of China (61673324, 61503314, and 61561146396); U.S. National Institutes of Health R01GM120624 and National Science Foundation DMS-1518001; the Natural Science Foundation of Fujian (2016J01316), and the scholarship from China Scholarship Council (201606315011).</p>
</fn>
</fn-group>
<ack><p>We thank Prof. Fan Yang and Jiping Tao at Xiamen University, China, for helpful discussions and suggestions.</p>
</ack>
<sec sec-type="supplementary material"><title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fmicb.2018.00872/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fmicb.2018.00872/full#supplementary-material</ext-link>
</p>
<supplementary-material content-type="local-data" id="SM1"><label>FILE S1</label>
<caption><p>Detailed descriptions of method and results.</p>
</caption>
<media xlink:href="Presentation_1.pdf"><caption><p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="SM2"><label>FILE S2</label>
<caption><p><italic>LC-specific 40</italic>
-mers and sequences.</p>
</caption>
<media xlink:href="Presentation_2.zip"><caption><p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
</sec>
<ref-list><title>References</title>
<ref id="B1"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Alneberg</surname>
<given-names>J.</given-names>
</name>
<name><surname>Bjarnason</surname>
<given-names>B. S.</given-names>
</name>
<name><surname>De Bruijn</surname>
<given-names>I.</given-names>
</name>
<name><surname>Schirmer</surname>
<given-names>M.</given-names>
</name>
<name><surname>Quick</surname>
<given-names>J.</given-names>
</name>
<name><surname>Ijaz</surname>
<given-names>U. Z.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>2014</year>
). <article-title>Binning metagenomic contigs by coverage and composition.</article-title>
<source><italic>Nat. Methods</italic>
</source>
<volume>11</volume>
<fpage>1144</fpage>
–<lpage>1146</lpage>
. <pub-id pub-id-type="doi">10.1038/nmeth.3103</pub-id>
<pub-id pub-id-type="pmid">25218180</pub-id>
</mixed-citation>
</ref>
<ref id="B2"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Altschul</surname>
<given-names>S. F.</given-names>
</name>
<name><surname>Madden</surname>
<given-names>T. L.</given-names>
</name>
<name><surname>Schäffer</surname>
<given-names>A. A.</given-names>
</name>
<name><surname>Zhang</surname>
<given-names>J.</given-names>
</name>
<name><surname>Zhang</surname>
<given-names>Z.</given-names>
</name>
<name><surname>Miller</surname>
<given-names>W.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>1997</year>
). <article-title>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.</article-title>
<source><italic>Nucleic Acids Res.</italic>
</source>
<volume>25</volume>
<fpage>3389</fpage>
–<lpage>3402</lpage>
. <pub-id pub-id-type="doi">10.1093/nar/25.17.3389</pub-id>
<pub-id pub-id-type="pmid">9254694</pub-id>
</mixed-citation>
</ref>
<ref id="B3"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Benoit</surname>
<given-names>G.</given-names>
</name>
<name><surname>Peterlongo</surname>
<given-names>P.</given-names>
</name>
<name><surname>Mariadassou</surname>
<given-names>M.</given-names>
</name>
<name><surname>Drezen</surname>
<given-names>E.</given-names>
</name>
<name><surname>Schbath</surname>
<given-names>S.</given-names>
</name>
<name><surname>Lavenier</surname>
<given-names>D.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>2016</year>
). <article-title>Multiple comparative metagenomics using multiset k-mer counting.</article-title>
<source><italic>PeerJ Comput. Sci.</italic>
</source>
<volume>2</volume>
:<issue>e94</issue>
<pub-id pub-id-type="doi">10.7717/peerj-cs.94</pub-id>
</mixed-citation>
</ref>
<ref id="B4"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Breiman</surname>
<given-names>L.</given-names>
</name>
</person-group>
 (<year>2001</year>
). <article-title>Random forests.</article-title>
<source><italic>Mach. Learn.</italic>
</source>
<volume>45</volume>
<fpage>5</fpage>
–<lpage>32</lpage>
. <pub-id pub-id-type="doi">10.1023/A:1010933404324</pub-id>
</mixed-citation>
</ref>
<ref id="B5"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Costello</surname>
<given-names>E. K.</given-names>
</name>
<name><surname>Lauber</surname>
<given-names>C. L.</given-names>
</name>
<name><surname>Hamady</surname>
<given-names>M.</given-names>
</name>
<name><surname>Fierer</surname>
<given-names>N.</given-names>
</name>
<name><surname>Gordon</surname>
<given-names>J. I.</given-names>
</name>
<name><surname>Knight</surname>
<given-names>R.</given-names>
</name>
</person-group>
 (<year>2009</year>
). <article-title>Bacterial community variation in human body habitats across space and time.</article-title>
<source><italic>Science</italic>
</source>
<volume>326</volume>
<fpage>1694</fpage>
–<lpage>1697</lpage>
. <pub-id pub-id-type="doi">10.1126/science.1177486</pub-id>
<pub-id pub-id-type="pmid">19892944</pub-id>
</mixed-citation>
</ref>
<ref id="B6"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Cui</surname>
<given-names>H.</given-names>
</name>
<name><surname>Zhang</surname>
<given-names>X.</given-names>
</name>
</person-group>
 (<year>2013</year>
). <article-title>Alignment-free supervised classification of metagenomes by recursive SVM.</article-title>
<source><italic>BMC Genomics</italic>
</source>
<volume>14</volume>
:<issue>641</issue>
. <pub-id pub-id-type="doi">10.1186/1471-2164-14-641</pub-id>
<pub-id pub-id-type="pmid">24053649</pub-id>
</mixed-citation>
</ref>
<ref id="B7"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Feng</surname>
<given-names>Q.</given-names>
</name>
<name><surname>Liang</surname>
<given-names>S.</given-names>
</name>
<name><surname>Jia</surname>
<given-names>H.</given-names>
</name>
<name><surname>Stadlmayr</surname>
<given-names>A.</given-names>
</name>
<name><surname>Tang</surname>
<given-names>L.</given-names>
</name>
<name><surname>Lan</surname>
<given-names>Z.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>2015</year>
). <article-title>Gut microbiome development along the colorectal adenoma-carcinoma sequence.</article-title>
<source><italic>Nat. Commun.</italic>
</source>
<volume>6</volume>
:<issue>6528</issue>
. <pub-id pub-id-type="doi">10.1038/ncomms7528</pub-id>
<pub-id pub-id-type="pmid">25758642</pub-id>
</mixed-citation>
</ref>
<ref id="B8"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Fofanov</surname>
<given-names>Y.</given-names>
</name>
<name><surname>Luo</surname>
<given-names>Y.</given-names>
</name>
<name><surname>Katili</surname>
<given-names>C.</given-names>
</name>
<name><surname>Wang</surname>
<given-names>J.</given-names>
</name>
<name><surname>Belosludtsev</surname>
<given-names>Y.</given-names>
</name>
<name><surname>Powdrill</surname>
<given-names>T.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>2004</year>
). <article-title>How independent are the appearances of n-mers in different genomes?</article-title>
<source><italic>Bioinformatics</italic>
</source>
<volume>20</volume>
<fpage>2421</fpage>
–<lpage>2428</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/bth266</pub-id>
<pub-id pub-id-type="pmid">15087315</pub-id>
</mixed-citation>
</ref>
<ref id="B9"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Grabherr</surname>
<given-names>M. G.</given-names>
</name>
<name><surname>Haas</surname>
<given-names>B. J.</given-names>
</name>
<name><surname>Yassour</surname>
<given-names>M.</given-names>
</name>
<name><surname>Levin</surname>
<given-names>J. Z.</given-names>
</name>
<name><surname>Thompson</surname>
<given-names>D. A.</given-names>
</name>
<name><surname>Amit</surname>
<given-names>I.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>2011</year>
). <article-title>Full-length transcriptome assembly from RNA-Seq data without a reference genome.</article-title>
<source><italic>Nat. Biotechnol.</italic>
</source>
<volume>29</volume>
<fpage>644</fpage>
–<lpage>652</lpage>
. <pub-id pub-id-type="doi">10.1038/nbt.1883</pub-id>
<pub-id pub-id-type="pmid">21572440</pub-id>
</mixed-citation>
</ref>
<ref id="B10"><mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Han</surname>
<given-names>W.</given-names>
</name>
<name><surname>Wang</surname>
<given-names>M.</given-names>
</name>
<name><surname>Ye</surname>
<given-names>Y.</given-names>
</name>
</person-group>
 (<year>2017</year>
). <article-title>“A concurrent subtractive assembly approach for identification of disease associated sub-metagenomes,” in</article-title>
<source><italic>Research in Computational Molecular Biology. RECOMB 2017. Lecture Notes in Computer Science</italic>
</source>
<volume>Vol. 10229</volume>
<role>ed.</role>
<person-group person-group-type="editor"><name><surname>Sahinalp</surname>
<given-names>S.</given-names>
</name>
</person-group>
 (<publisher-loc>Cham</publisher-loc>
: <publisher-name>Springer</publisher-name>
).</mixed-citation>
</ref>
<ref id="B11"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname>
<given-names>X.</given-names>
</name>
<name><surname>Madan</surname>
<given-names>A.</given-names>
</name>
</person-group>
 (<year>1999</year>
). <article-title>CAP3: a DNA sequence assembly program.</article-title>
<source><italic>Genome Res.</italic>
</source>
<volume>9</volume>
<fpage>868</fpage>
–<lpage>877</lpage>
. <pub-id pub-id-type="doi">10.1101/gr.9.9.868</pub-id>
<pub-id pub-id-type="pmid">10508846</pub-id>
</mixed-citation>
</ref>
<ref id="B12"><mixed-citation publication-type="journal"><collab>Human Microbiome Project Consortium</collab>
 (<year>2012</year>
). <article-title>Structure, function and diversity of the healthy human microbiome.</article-title>
<source><italic>Nature</italic>
</source>
<volume>486</volume>
<fpage>207</fpage>
–<lpage>214</lpage>
. <pub-id pub-id-type="doi">10.1038/nature11234</pub-id>
<pub-id pub-id-type="pmid">22699609</pub-id>
</mixed-citation>
</ref>
<ref id="B13"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Jiang</surname>
<given-names>B.</given-names>
</name>
<name><surname>Song</surname>
<given-names>K.</given-names>
</name>
<name><surname>Ren</surname>
<given-names>J.</given-names>
</name>
<name><surname>Deng</surname>
<given-names>M.</given-names>
</name>
<name><surname>Sun</surname>
<given-names>F.</given-names>
</name>
<name><surname>Zhang</surname>
<given-names>X.</given-names>
</name>
</person-group>
 (<year>2012</year>
). <article-title>Comparison of metagenomic samples using sequence signatures.</article-title>
<source><italic>BMC Genomics</italic>
</source>
<volume>13</volume>
:<issue>730</issue>
. <pub-id pub-id-type="doi">10.1186/1471-2164-13-730</pub-id>
<pub-id pub-id-type="pmid">23268604</pub-id>
</mixed-citation>
</ref>
<ref id="B14"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Jiang</surname>
<given-names>R.</given-names>
</name>
</person-group>
 (<year>2015</year>
). <article-title>Walking on multiple disease-gene networks to prioritize candidate genes.</article-title>
<source><italic>J. Mol. Cell Biol.</italic>
</source>
<volume>7</volume>
<fpage>214</fpage>
–<lpage>230</lpage>
. <pub-id pub-id-type="doi">10.1093/jmcb/mjv008</pub-id>
<pub-id pub-id-type="pmid">25681405</pub-id>
</mixed-citation>
</ref>
<ref id="B15"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Karlsson</surname>
<given-names>F. H.</given-names>
</name>
<name><surname>Tremaroli</surname>
<given-names>V.</given-names>
</name>
<name><surname>Nookaew</surname>
<given-names>I.</given-names>
</name>
<name><surname>Bergström</surname>
<given-names>G.</given-names>
</name>
<name><surname>Behre</surname>
<given-names>C. J.</given-names>
</name>
<name><surname>Fagerberg</surname>
<given-names>B.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>2013</year>
). <article-title>Gut metagenome in European women with normal, impaired and diabetic glucose control.</article-title>
<source><italic>Nature</italic>
</source>
<volume>498</volume>
<fpage>99</fpage>
–<lpage>103</lpage>
. <pub-id pub-id-type="doi">10.1038/nature12198</pub-id>
<pub-id pub-id-type="pmid">23719380</pub-id>
</mixed-citation>
</ref>
<ref id="B16"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Kunin</surname>
<given-names>V.</given-names>
</name>
<name><surname>Copeland</surname>
<given-names>A.</given-names>
</name>
<name><surname>Lapidus</surname>
<given-names>A.</given-names>
</name>
<name><surname>Mavromatis</surname>
<given-names>K.</given-names>
</name>
<name><surname>Hugenholtz</surname>
<given-names>P.</given-names>
</name>
</person-group>
 (<year>2008</year>
). <article-title>A bioinformatician’s guide to metagenomics.</article-title>
<source><italic>Microbiol. Mol. Biol. Rev.</italic>
</source>
<volume>72</volume>
<fpage>557</fpage>
–<lpage>578</lpage>
. <pub-id pub-id-type="doi">10.1128/MMBR.00009-08</pub-id>
<pub-id pub-id-type="pmid">19052320</pub-id>
</mixed-citation>
</ref>
<ref id="B17"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Le</surname>
<given-names>V. V.</given-names>
</name>
<name><surname>Lang</surname>
<given-names>T. V.</given-names>
</name>
<name><surname>Le</surname>
<given-names>T. B.</given-names>
</name>
<name><surname>Hoai</surname>
<given-names>T. V.</given-names>
</name>
</person-group>
 (<year>2015</year>
). <article-title>A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads.</article-title>
<source><italic>Algorithms Mol. Biol.</italic>
</source>
<volume>10</volume>
:<issue>2</issue>
. <pub-id pub-id-type="doi">10.1186/s13015-014-0030-4</pub-id>
<pub-id pub-id-type="pmid">25648210</pub-id>
</mixed-citation>
</ref>
<ref id="B18"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Li</surname>
<given-names>D.</given-names>
</name>
<name><surname>Liu</surname>
<given-names>C.-M.</given-names>
</name>
<name><surname>Luo</surname>
<given-names>R.</given-names>
</name>
<name><surname>Sadakane</surname>
<given-names>K.</given-names>
</name>
<name><surname>Lam</surname>
<given-names>T.-W.</given-names>
</name>
</person-group>
 (<year>2015</year>
). <article-title>MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph.</article-title>
<source><italic>Bioinformatics</italic>
</source>
<volume>31</volume>
<fpage>1674</fpage>
–<lpage>1676</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/btv033</pub-id>
<pub-id pub-id-type="pmid">25609793</pub-id>
</mixed-citation>
</ref>
<ref id="B19"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Li</surname>
<given-names>R.</given-names>
</name>
<name><surname>Zhu</surname>
<given-names>H.</given-names>
</name>
<name><surname>Ruan</surname>
<given-names>J.</given-names>
</name>
<name><surname>Qian</surname>
<given-names>W.</given-names>
</name>
<name><surname>Fang</surname>
<given-names>X.</given-names>
</name>
<name><surname>Shi</surname>
<given-names>Z.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>2010</year>
). <article-title>De novo assembly of human genomes with massively parallel short read sequencing.</article-title>
<source><italic>Genome Res.</italic>
</source>
<volume>20</volume>
<fpage>265</fpage>
–<lpage>272</lpage>
. <pub-id pub-id-type="doi">10.1101/gr.097261.109</pub-id>
<pub-id pub-id-type="pmid">20019144</pub-id>
</mixed-citation>
</ref>
<ref id="B20"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Liao</surname>
<given-names>W.</given-names>
</name>
<name><surname>Ren</surname>
<given-names>J.</given-names>
</name>
<name><surname>Wang</surname>
<given-names>K.</given-names>
</name>
<name><surname>Wang</surname>
<given-names>S.</given-names>
</name>
<name><surname>Zeng</surname>
<given-names>F.</given-names>
</name>
<name><surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>2016</year>
). <article-title>Alignment-free transcriptomic and metatranscriptomic comparison using sequencing signatures with variable length markov chains.</article-title>
<source><italic>Sci. Rep.</italic>
</source>
<volume>6</volume>
:<issue>37243</issue>
. <pub-id pub-id-type="doi">10.1038/srep37243</pub-id>
<pub-id pub-id-type="pmid">27876823</pub-id>
</mixed-citation>
</ref>
<ref id="B21"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lozupone</surname>
<given-names>C. A.</given-names>
</name>
<name><surname>Stombaugh</surname>
<given-names>J.</given-names>
</name>
<name><surname>Gonzalez</surname>
<given-names>A.</given-names>
</name>
<name><surname>Ackermann</surname>
<given-names>G.</given-names>
</name>
<name><surname>Wendel</surname>
<given-names>D.</given-names>
</name>
<name><surname>Vázquez-Baeza</surname>
<given-names>Y.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>2013</year>
). <article-title>Meta-analyses of studies of the human microbiota.</article-title>
<source><italic>Genome Res.</italic>
</source>
<volume>23</volume>
<fpage>1704</fpage>
–<lpage>1714</lpage>
. <pub-id pub-id-type="doi">10.1101/gr.151803.112</pub-id>
<pub-id pub-id-type="pmid">23861384</pub-id>
</mixed-citation>
</ref>
<ref id="B22"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lu</surname>
<given-names>Y. Y.</given-names>
</name>
<name><surname>Chen</surname>
<given-names>T.</given-names>
</name>
<name><surname>Fuhrman</surname>
<given-names>J. A.</given-names>
</name>
<name><surname>Sun</surname>
<given-names>F.</given-names>
</name>
</person-group>
 (<year>2017</year>
). <article-title>COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment and paired-end read LinkAge.</article-title>
<source><italic>Bioinformatics</italic>
</source>
<volume>33</volume>
<fpage>791</fpage>
–<lpage>798</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/btw290</pub-id>
<pub-id pub-id-type="pmid">27256312</pub-id>
</mixed-citation>
</ref>
<ref id="B23"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Marçais</surname>
<given-names>G.</given-names>
</name>
<name><surname>Kingsford</surname>
<given-names>C.</given-names>
</name>
</person-group>
 (<year>2011</year>
). <article-title>A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.</article-title>
<source><italic>Bioinformatics</italic>
</source>
<volume>27</volume>
<fpage>764</fpage>
–<lpage>770</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/btr011</pub-id>
<pub-id pub-id-type="pmid">21217122</pub-id>
</mixed-citation>
</ref>
<ref id="B24"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Nielsen</surname>
<given-names>H. B.</given-names>
</name>
<name><surname>Almeida</surname>
<given-names>M.</given-names>
</name>
<name><surname>Juncker</surname>
<given-names>A. S.</given-names>
</name>
<name><surname>Rasmussen</surname>
<given-names>S.</given-names>
</name>
<name><surname>Li</surname>
<given-names>J.</given-names>
</name>
<name><surname>Sunagawa</surname>
<given-names>S.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>2014</year>
). <article-title>Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes.</article-title>
<source><italic>Nat. Biotechnol.</italic>
</source>
<volume>32</volume>
<fpage>822</fpage>
–<lpage>828</lpage>
. <pub-id pub-id-type="doi">10.1038/nbt.2939</pub-id>
<pub-id pub-id-type="pmid">24997787</pub-id>
</mixed-citation>
</ref>
<ref id="B25"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Papudeshi</surname>
<given-names>B.</given-names>
</name>
<name><surname>Haggerty</surname>
<given-names>J. M.</given-names>
</name>
<name><surname>Doane</surname>
<given-names>M.</given-names>
</name>
<name><surname>Morris</surname>
<given-names>M. M.</given-names>
</name>
<name><surname>Walsh</surname>
<given-names>K.</given-names>
</name>
<name><surname>Beattie</surname>
<given-names>D. T.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>2017</year>
). <article-title>Optimizing and evaluating the reconstruction of Metagenome-assembled microbial genomes.</article-title>
<source><italic>BMC Genomics</italic>
</source>
<volume>18</volume>
:<issue>915</issue>
. <pub-id pub-id-type="doi">10.1186/s12864-017-4294-1</pub-id>
<pub-id pub-id-type="pmid">29183281</pub-id>
</mixed-citation>
</ref>
<ref id="B26"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Pasolli</surname>
<given-names>E.</given-names>
</name>
<name><surname>Schiffer</surname>
<given-names>L.</given-names>
</name>
<name><surname>Manghi</surname>
<given-names>P.</given-names>
</name>
<name><surname>Renson</surname>
<given-names>A.</given-names>
</name>
<name><surname>Obenchain</surname>
<given-names>V.</given-names>
</name>
<name><surname>Truong</surname>
<given-names>D. T.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>2017</year>
). <article-title>Accessible, curated metagenomic data through ExperimentHub.</article-title>
<source><italic>Nat. Methods</italic>
</source>
<volume>14</volume>
<fpage>1023</fpage>
–<lpage>1024</lpage>
. <pub-id pub-id-type="doi">10.1038/nmeth.4468</pub-id>
<pub-id pub-id-type="pmid">29088129</pub-id>
</mixed-citation>
</ref>
<ref id="B27"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Pasolli</surname>
<given-names>E.</given-names>
</name>
<name><surname>Truong</surname>
<given-names>D. T.</given-names>
</name>
<name><surname>Malik</surname>
<given-names>F.</given-names>
</name>
<name><surname>Waldron</surname>
<given-names>L.</given-names>
</name>
<name><surname>Segata</surname>
<given-names>N.</given-names>
</name>
</person-group>
 (<year>2016</year>
). <article-title>Machine learning meta-analysis of large metagenomic datasets: tools and biological insights.</article-title>
<source><italic>PLoS Comput. Biol.</italic>
</source>
<volume>12</volume>
:<issue>e1004977</issue>
. <pub-id pub-id-type="doi">10.1371/journal.pcbi.1004977</pub-id>
<pub-id pub-id-type="pmid">27400279</pub-id>
</mixed-citation>
</ref>
<ref id="B28"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Paulus</surname>
<given-names>W.</given-names>
</name>
<name><surname>Jellinger</surname>
<given-names>K.</given-names>
</name>
</person-group>
 (<year>1991</year>
). <article-title>The neuropathologic basis of different clinical subgroups of Parkinson’s disease.</article-title>
<source><italic>J. Neuropathol. Exp. Neurol.</italic>
</source>
<volume>50</volume>
<fpage>743</fpage>
–<lpage>755</lpage>
. <pub-id pub-id-type="doi">10.1097/00005072-199111000-00006</pub-id>
<pub-id pub-id-type="pmid">1748881</pub-id>
</mixed-citation>
</ref>
<ref id="B29"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Qin</surname>
<given-names>J.</given-names>
</name>
<name><surname>Li</surname>
<given-names>R.</given-names>
</name>
<name><surname>Raes</surname>
<given-names>J.</given-names>
</name>
<name><surname>Arumugam</surname>
<given-names>M.</given-names>
</name>
<name><surname>Burgdorf</surname>
<given-names>K. S.</given-names>
</name>
<name><surname>Manichanh</surname>
<given-names>C.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>2010</year>
). <article-title>A human gut microbial gene catalogue established by metagenomic sequencing.</article-title>
<source><italic>Nature</italic>
</source>
<volume>464</volume>
<fpage>59</fpage>
–<lpage>65</lpage>
. <pub-id pub-id-type="doi">10.1038/nature08821</pub-id>
<pub-id pub-id-type="pmid">20203603</pub-id>
</mixed-citation>
</ref>
<ref id="B30"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Qin</surname>
<given-names>J.</given-names>
</name>
<name><surname>Li</surname>
<given-names>Y.</given-names>
</name>
<name><surname>Cai</surname>
<given-names>Z.</given-names>
</name>
<name><surname>Li</surname>
<given-names>S.</given-names>
</name>
<name><surname>Zhu</surname>
<given-names>J.</given-names>
</name>
<name><surname>Zhang</surname>
<given-names>F.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>2012</year>
). <article-title>A metagenome-wide association study of gut microbiota in type 2 diabetes.</article-title>
<source><italic>Nature</italic>
</source>
<volume>490</volume>
<fpage>55</fpage>
–<lpage>60</lpage>
. <pub-id pub-id-type="doi">10.1038/nature11450</pub-id>
<pub-id pub-id-type="pmid">23023125</pub-id>
</mixed-citation>
</ref>
<ref id="B31"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Qin</surname>
<given-names>N.</given-names>
</name>
<name><surname>Yang</surname>
<given-names>F.</given-names>
</name>
<name><surname>Li</surname>
<given-names>A.</given-names>
</name>
<name><surname>Prifti</surname>
<given-names>E.</given-names>
</name>
<name><surname>Chen</surname>
<given-names>Y.</given-names>
</name>
<name><surname>Shao</surname>
<given-names>L.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>2014</year>
). <article-title>Alterations of the human gut microbiome in liver cirrhosis.</article-title>
<source><italic>Nature</italic>
</source>
<volume>513</volume>
<fpage>59</fpage>
–<lpage>64</lpage>
. <pub-id pub-id-type="doi">10.1038/nature13568</pub-id>
<pub-id pub-id-type="pmid">25079328</pub-id>
</mixed-citation>
</ref>
<ref id="B32"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Quast</surname>
<given-names>C.</given-names>
</name>
<name><surname>Pruesse</surname>
<given-names>E.</given-names>
</name>
<name><surname>Yilmaz</surname>
<given-names>P.</given-names>
</name>
<name><surname>Gerken</surname>
<given-names>J.</given-names>
</name>
<name><surname>Schweer</surname>
<given-names>T.</given-names>
</name>
<name><surname>Yarza</surname>
<given-names>P.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>2012</year>
). <article-title>The SILVA ribosomal RNA gene database project: improved data processing and web-based tools.</article-title>
<source><italic>Nucleic Acids Res.</italic>
</source>
<volume>41</volume>
<fpage>D590</fpage>
–<lpage>D596</lpage>
. <pub-id pub-id-type="doi">10.1093/nar/gks1219</pub-id>
<pub-id pub-id-type="pmid">23193283</pub-id>
</mixed-citation>
</ref>
<ref id="B33"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Ren</surname>
<given-names>J.</given-names>
</name>
<name><surname>Ahlgren</surname>
<given-names>N. A.</given-names>
</name>
<name><surname>Lu</surname>
<given-names>Y. Y.</given-names>
</name>
<name><surname>Fuhrman</surname>
<given-names>J. A.</given-names>
</name>
<name><surname>Sun</surname>
<given-names>F.</given-names>
</name>
</person-group>
 (<year>2017</year>
). <article-title>VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data.</article-title>
<source><italic>Microbiome</italic>
</source>
<volume>5</volume>
:<issue>69</issue>
. <pub-id pub-id-type="doi">10.1186/s40168-017-0283-5</pub-id>
<pub-id pub-id-type="pmid">28683828</pub-id>
</mixed-citation>
</ref>
<ref id="B34"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Richter</surname>
<given-names>D. C.</given-names>
</name>
<name><surname>Ott</surname>
<given-names>F.</given-names>
</name>
<name><surname>Auch</surname>
<given-names>A. F.</given-names>
</name>
<name><surname>Schmid</surname>
<given-names>R.</given-names>
</name>
<name><surname>Huson</surname>
<given-names>D. H.</given-names>
</name>
</person-group>
 (<year>2008</year>
). <article-title>MetaSim—a sequencing simulator for genomics and metagenomics.</article-title>
<source><italic>PLoS One</italic>
</source>
<volume>3</volume>
:<issue>e3373</issue>
. <pub-id pub-id-type="doi">10.1371/journal.pone.0003373</pub-id>
<pub-id pub-id-type="pmid">18841204</pub-id>
</mixed-citation>
</ref>
<ref id="B35"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Rizk</surname>
<given-names>G.</given-names>
</name>
<name><surname>Lavenier</surname>
<given-names>D.</given-names>
</name>
<name><surname>Chikhi</surname>
<given-names>R.</given-names>
</name>
</person-group>
 (<year>2013</year>
). <article-title>DSK: k-mer counting with very low memory usage.</article-title>
<source><italic>Bioinformatics</italic>
</source>
<volume>29</volume>
<fpage>652</fpage>
–<lpage>653</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/btt020</pub-id>
<pub-id pub-id-type="pmid">23325618</pub-id>
</mixed-citation>
</ref>
<ref id="B36"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Sangwan</surname>
<given-names>N.</given-names>
</name>
<name><surname>Xia</surname>
<given-names>F.</given-names>
</name>
<name><surname>Gilbert</surname>
<given-names>J. A.</given-names>
</name>
</person-group>
 (<year>2016</year>
). <article-title>Recovering complete and draft population genomes from metagenome datasets.</article-title>
<source><italic>Microbiome</italic>
</source>
<volume>4</volume>
:<issue>8</issue>
. <pub-id pub-id-type="doi">10.1186/s40168-016-0154-5</pub-id>
<pub-id pub-id-type="pmid">26951112</pub-id>
</mixed-citation>
</ref>
<ref id="B37"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Sczyrba</surname>
<given-names>A.</given-names>
</name>
<name><surname>Hofmann</surname>
<given-names>P.</given-names>
</name>
<name><surname>Belmann</surname>
<given-names>P.</given-names>
</name>
<name><surname>Koslicki</surname>
<given-names>D.</given-names>
</name>
<name><surname>Janssen</surname>
<given-names>S.</given-names>
</name>
<name><surname>Dröge</surname>
<given-names>J.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>2017</year>
). <article-title>Critical assessment of metagenome interpretation—a benchmark of metagenomics software.</article-title>
<source><italic>Nat. Methods</italic>
</source>
<volume>14</volume>
<fpage>1063</fpage>
–<lpage>1071</lpage>
. <pub-id pub-id-type="doi">10.1038/nmeth.4458</pub-id>
<pub-id pub-id-type="pmid">28967888</pub-id>
</mixed-citation>
</ref>
<ref id="B38"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Segata</surname>
<given-names>N.</given-names>
</name>
<name><surname>Izard</surname>
<given-names>J.</given-names>
</name>
<name><surname>Waldron</surname>
<given-names>L.</given-names>
</name>
<name><surname>Gevers</surname>
<given-names>D.</given-names>
</name>
<name><surname>Miropolsky</surname>
<given-names>L.</given-names>
</name>
<name><surname>Garrett</surname>
<given-names>W. S.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>2011</year>
). <article-title>Metagenomic biomarker discovery and explanation.</article-title>
<source><italic>Genome Biol.</italic>
</source>
<volume>12</volume>
:<issue>R60</issue>
. <pub-id pub-id-type="doi">10.1186/gb-2011-12-6-r60</pub-id>
<pub-id pub-id-type="pmid">21702898</pub-id>
</mixed-citation>
</ref>
<ref id="B39"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name><surname>Lei</surname>
<given-names>X.</given-names>
</name>
<name><surname>Wang</surname>
<given-names>S.</given-names>
</name>
<name><surname>Wang</surname>
<given-names>Z.</given-names>
</name>
<name><surname>Song</surname>
<given-names>N.</given-names>
</name>
<name><surname>Zeng</surname>
<given-names>F.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>2015</year>
). <article-title>Effect of k-tuple length on sample-comparison with high-throughput sequencing data.</article-title>
<source><italic>Biochem. Biophys. Res. Commun.</italic>
</source>
<volume>469</volume>
<fpage>1021</fpage>
–<lpage>1027</lpage>
. <pub-id pub-id-type="doi">10.1016/j.bbrc.2015.11.094</pub-id>
<pub-id pub-id-type="pmid">26721429</pub-id>
</mixed-citation>
</ref>
<ref id="B40"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name><surname>Liu</surname>
<given-names>L.</given-names>
</name>
<name><surname>Chen</surname>
<given-names>L.</given-names>
</name>
<name><surname>Chen</surname>
<given-names>T.</given-names>
</name>
<name><surname>Sun</surname>
<given-names>F.</given-names>
</name>
</person-group>
 (<year>2014</year>
). <article-title>Comparison of metatranscriptomic samples based on k-tuple frequencies.</article-title>
<source><italic>PLoS One</italic>
</source>
<volume>9</volume>
:<issue>e84348</issue>
. <pub-id pub-id-type="doi">10.1371/journal.pone.0084348</pub-id>
<pub-id pub-id-type="pmid">24392128</pub-id>
</mixed-citation>
</ref>
<ref id="B41"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name><surname>Wang</surname>
<given-names>K.</given-names>
</name>
<name><surname>Lu</surname>
<given-names>Y. Y.</given-names>
</name>
<name><surname>Sun</surname>
<given-names>F.</given-names>
</name>
</person-group>
 (<year>2017</year>
). <article-title>Improving contig binning of metagenomic data using dS2oligonucleotide frequency dissimilarity.</article-title>
<source><italic>BMC Bioinformatics</italic>
</source>
<volume>18</volume>
:<issue>425</issue>
. <pub-id pub-id-type="doi">10.1186/s12859-017-1835-1</pub-id>
<pub-id pub-id-type="pmid">28931373</pub-id>
</mixed-citation>
</ref>
<ref id="B42"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wen</surname>
<given-names>C.</given-names>
</name>
<name><surname>Zheng</surname>
<given-names>Z.</given-names>
</name>
<name><surname>Shao</surname>
<given-names>T.</given-names>
</name>
<name><surname>Lin</surname>
<given-names>L.</given-names>
</name>
<name><surname>Xie</surname>
<given-names>Z.</given-names>
</name>
<name><surname>Chatelier</surname>
<given-names>E. L.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>2017</year>
). <article-title>Quantitative metagenomics reveals unique gut microbiome biomarkers in ankylosing spondylitis.</article-title>
<source><italic>Genome Biol.</italic>
</source>
<volume>18</volume>
:<issue>142</issue>
. <pub-id pub-id-type="doi">10.1186/s13059-017-1271-6</pub-id>
<pub-id pub-id-type="pmid">28750650</pub-id>
</mixed-citation>
</ref>
<ref id="B43"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>White</surname>
<given-names>J. R.</given-names>
</name>
<name><surname>Nagarajan</surname>
<given-names>N.</given-names>
</name>
<name><surname>Pop</surname>
<given-names>M.</given-names>
</name>
</person-group>
 (<year>2009</year>
). <article-title>Statistical methods for detecting differentially abundant features in clinical metagenomic samples.</article-title>
<source><italic>PLoS Comput. Biol.</italic>
</source>
<volume>5</volume>
:<issue>e1000352</issue>
. <pub-id pub-id-type="doi">10.1371/journal.pcbi.1000352</pub-id>
<pub-id pub-id-type="pmid">19360128</pub-id>
</mixed-citation>
</ref>
<ref id="B44"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wiest</surname>
<given-names>R.</given-names>
</name>
<name><surname>Lawson</surname>
<given-names>M.</given-names>
</name>
<name><surname>Geuking</surname>
<given-names>M.</given-names>
</name>
</person-group>
 (<year>2014</year>
). <article-title>Pathological bacterial translocation in liver cirrhosis.</article-title>
<source><italic>J. Hepatol.</italic>
</source>
<volume>60</volume>
<fpage>197</fpage>
–<lpage>209</lpage>
. <pub-id pub-id-type="doi">10.1016/j.jhep.2013.07.044</pub-id>
<pub-id pub-id-type="pmid">23993913</pub-id>
</mixed-citation>
</ref>
<ref id="B45"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname>
<given-names>Y.-W.</given-names>
</name>
<name><surname>Simmons</surname>
<given-names>B. A.</given-names>
</name>
<name><surname>Singer</surname>
<given-names>S. W.</given-names>
</name>
</person-group>
 (<year>2016</year>
). <article-title>MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets.</article-title>
<source><italic>Bioinformatics</italic>
</source>
<volume>32</volume>
<fpage>605</fpage>
–<lpage>607</lpage>
. <pub-id pub-id-type="doi">10.1093/bioinformatics/btv638</pub-id>
<pub-id pub-id-type="pmid">26515820</pub-id>
</mixed-citation>
</ref>
<ref id="B46"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Xing</surname>
<given-names>X.</given-names>
</name>
<name><surname>Liu</surname>
<given-names>J. S.</given-names>
</name>
<name><surname>Zhong</surname>
<given-names>W.</given-names>
</name>
</person-group>
 (<year>2017</year>
). <article-title>MetaGen: reference-free learning with multiple metagenomic samples.</article-title>
<source><italic>Genome Biol.</italic>
</source>
<volume>18</volume>
:<issue>187</issue>
. <pub-id pub-id-type="doi">10.1186/s13059-017-1323-y</pub-id>
<pub-id pub-id-type="pmid">28974263</pub-id>
</mixed-citation>
</ref>
<ref id="B47"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Yatsunenko</surname>
<given-names>T.</given-names>
</name>
<name><surname>Rey</surname>
<given-names>F. E.</given-names>
</name>
<name><surname>Manary</surname>
<given-names>M. J.</given-names>
</name>
<name><surname>Trehan</surname>
<given-names>I.</given-names>
</name>
<name><surname>Dominguez-Bello</surname>
<given-names>M. G.</given-names>
</name>
<name><surname>Contreras</surname>
<given-names>M.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>2012</year>
). <article-title>Human gut microbiome viewed across age and geography.</article-title>
<source><italic>Nature</italic>
</source>
<volume>486</volume>
<fpage>222</fpage>
–<lpage>227</lpage>
. <pub-id pub-id-type="doi">10.1038/nature11053</pub-id>
<pub-id pub-id-type="pmid">22699611</pub-id>
</mixed-citation>
</ref>
<ref id="B48"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Zaharia</surname>
<given-names>M.</given-names>
</name>
<name><surname>Chowdhury</surname>
<given-names>M.</given-names>
</name>
<name><surname>Franklin</surname>
<given-names>M. J.</given-names>
</name>
<name><surname>Shenker</surname>
<given-names>S.</given-names>
</name>
<name><surname>Stoica</surname>
<given-names>I.</given-names>
</name>
</person-group>
 (<year>2010</year>
). <article-title>Spark: cluster computing with working sets.</article-title>
<source><italic>HotCloud</italic>
</source>
<volume>10</volume>
:<issue>95</issue>
.</mixed-citation>
</ref>
<ref id="B49"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name><surname>Lu</surname>
<given-names>X.</given-names>
</name>
<name><surname>Shi</surname>
<given-names>Q.</given-names>
</name>
<name><surname>Xu</surname>
<given-names>X. Q.</given-names>
</name>
<name><surname>Leung</surname>
<given-names>H. C.</given-names>
</name>
<name><surname>Harris</surname>
<given-names>L. N.</given-names>
</name>
<etal></etal>
</person-group>
 (<year>2006</year>
). <article-title>Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data.</article-title>
<source><italic>BMC Bioinformatics</italic>
</source>
<volume>7</volume>
:<issue>197</issue>
. <pub-id pub-id-type="doi">10.1186/1471-2105-7-197</pub-id>
<pub-id pub-id-type="pmid">16606446</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000C80  | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000C80  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration MERS

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri