Serveur d'exploration Cyberinfrastructure

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000507 ( Pmc/Corpus ); précédent : 0005069; suivant : 0005080 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Ultrafast clustering algorithms for metagenomic sequence analysis</title>
<author>
<name sortKey="Li, Weizhong" sort="Li, Weizhong" uniqKey="Li W" first="Weizhong" last="Li">Weizhong Li</name>
</author>
<author>
<name sortKey="Fu, Limin" sort="Fu, Limin" uniqKey="Fu L" first="Limin" last="Fu">Limin Fu</name>
</author>
<author>
<name sortKey="Niu, Beifang" sort="Niu, Beifang" uniqKey="Niu B" first="Beifang" last="Niu">Beifang Niu</name>
</author>
<author>
<name sortKey="Wu, Sitao" sort="Wu, Sitao" uniqKey="Wu S" first="Sitao" last="Wu">Sitao Wu</name>
</author>
<author>
<name sortKey="Wooley, John" sort="Wooley, John" uniqKey="Wooley J" first="John" last="Wooley">John Wooley</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">22772836</idno>
<idno type="pmc">3504929</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3504929</idno>
<idno type="RBID">PMC:3504929</idno>
<idno type="doi">10.1093/bib/bbs035</idno>
<date when="2012">2012</date>
<idno type="wicri:Area/Pmc/Corpus">000507</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Ultrafast clustering algorithms for metagenomic sequence analysis</title>
<author>
<name sortKey="Li, Weizhong" sort="Li, Weizhong" uniqKey="Li W" first="Weizhong" last="Li">Weizhong Li</name>
</author>
<author>
<name sortKey="Fu, Limin" sort="Fu, Limin" uniqKey="Fu L" first="Limin" last="Fu">Limin Fu</name>
</author>
<author>
<name sortKey="Niu, Beifang" sort="Niu, Beifang" uniqKey="Niu B" first="Beifang" last="Niu">Beifang Niu</name>
</author>
<author>
<name sortKey="Wu, Sitao" sort="Wu, Sitao" uniqKey="Wu S" first="Sitao" last="Wu">Sitao Wu</name>
</author>
<author>
<name sortKey="Wooley, John" sort="Wooley, John" uniqKey="Wooley J" first="John" last="Wooley">John Wooley</name>
</author>
</analytic>
<series>
<title level="j">Briefings in Bioinformatics</title>
<idno type="ISSN">1467-5463</idno>
<idno type="eISSN">1477-4054</idno>
<imprint>
<date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>The rapid advances of high-throughput sequencing technologies dramatically prompted metagenomic studies of microbial communities that exist at various environments. Fundamental questions in metagenomics include the identities, composition and dynamics of microbial populations and their functions and interactions. However, the massive quantity and the comprehensive complexity of these sequence data pose tremendous challenges in data analysis. These challenges include but are not limited to ever-increasing computational demand, biased sequence sampling, sequence errors, sequence artifacts and novel sequences. Sequence clustering methods can directly answer many of the fundamental questions by grouping similar sequences into families. In addition, clustering analysis also addresses the challenges in metagenomics. Thus, a large redundant data set can be represented with a small non-redundant set, where each cluster can be represented by a single entry or a consensus. Artifacts can be rapidly detected through clustering. Errors can be identified, filtered or corrected by using consensus from sequences within clusters.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Handelsman, J" uniqKey="Handelsman J">J Handelsman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wooley, Jc" uniqKey="Wooley J">JC Wooley</name>
</author>
<author>
<name sortKey="Godzik, A" uniqKey="Godzik A">A Godzik</name>
</author>
<author>
<name sortKey="Friedberg, I" uniqKey="Friedberg I">I Friedberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Venter, Jc" uniqKey="Venter J">JC Venter</name>
</author>
<author>
<name sortKey="Remington, K" uniqKey="Remington K">K Remington</name>
</author>
<author>
<name sortKey="Heidelberg, Jf" uniqKey="Heidelberg J">JF Heidelberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gill, Sr" uniqKey="Gill S">SR Gill</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
<author>
<name sortKey="Deboy, Rt" uniqKey="Deboy R">RT Deboy</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tringe, Sg" uniqKey="Tringe S">SG Tringe</name>
</author>
<author>
<name sortKey="Von Mering, C" uniqKey="Von Mering C">C von Mering</name>
</author>
<author>
<name sortKey="Kobayashi, A" uniqKey="Kobayashi A">A Kobayashi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mardis, Er" uniqKey="Mardis E">ER Mardis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dinsdale, Ea" uniqKey="Dinsdale E">EA Dinsdale</name>
</author>
<author>
<name sortKey="Edwards, Ra" uniqKey="Edwards R">RA Edwards</name>
</author>
<author>
<name sortKey="Hall, D" uniqKey="Hall D">D Hall</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hess, M" uniqKey="Hess M">M Hess</name>
</author>
<author>
<name sortKey="Sczyrba, A" uniqKey="Sczyrba A">A Sczyrba</name>
</author>
<author>
<name sortKey="Egan, R" uniqKey="Egan R">R Egan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Qin, J" uniqKey="Qin J">J Qin</name>
</author>
<author>
<name sortKey="Li, R" uniqKey="Li R">R Li</name>
</author>
<author>
<name sortKey="Raes, J" uniqKey="Raes J">J Raes</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Peterson, J" uniqKey="Peterson J">J Peterson</name>
</author>
<author>
<name sortKey="Garges, S" uniqKey="Garges S">S Garges</name>
</author>
<author>
<name sortKey="Giovanni, M" uniqKey="Giovanni M">M Giovanni</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chitsaz, H" uniqKey="Chitsaz H">H Chitsaz</name>
</author>
<author>
<name sortKey="Yee Greenbaum, Jl" uniqKey="Yee Greenbaum J">JL Yee-Greenbaum</name>
</author>
<author>
<name sortKey="Tesler, G" uniqKey="Tesler G">G Tesler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gomez Alvarez, V" uniqKey="Gomez Alvarez V">V Gomez-Alvarez</name>
</author>
<author>
<name sortKey="Teal, Tk" uniqKey="Teal T">TK Teal</name>
</author>
<author>
<name sortKey="Schmidt, Tm" uniqKey="Schmidt T">TM Schmidt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Niu, B" uniqKey="Niu B">B Niu</name>
</author>
<author>
<name sortKey="Fu, L" uniqKey="Fu L">L Fu</name>
</author>
<author>
<name sortKey="Sun, S" uniqKey="Sun S">S Sun</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schloss, Pd" uniqKey="Schloss P">PD Schloss</name>
</author>
<author>
<name sortKey="Westcott, Sl" uniqKey="Westcott S">SL Westcott</name>
</author>
<author>
<name sortKey="Ryabin, T" uniqKey="Ryabin T">T Ryabin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, W" uniqKey="Li W">W Li</name>
</author>
<author>
<name sortKey="Wooley, Jc" uniqKey="Wooley J">JC Wooley</name>
</author>
<author>
<name sortKey="Godzik, A" uniqKey="Godzik A">A Godzik</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yooseph, S" uniqKey="Yooseph S">S Yooseph</name>
</author>
<author>
<name sortKey="Sutton, G" uniqKey="Sutton G">G Sutton</name>
</author>
<author>
<name sortKey="Rusch, Db" uniqKey="Rusch D">DB Rusch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gilbert, Ja" uniqKey="Gilbert J">JA Gilbert</name>
</author>
<author>
<name sortKey="Field, D" uniqKey="Field D">D Field</name>
</author>
<author>
<name sortKey="Huang, Y" uniqKey="Huang Y">Y Huang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yona, G" uniqKey="Yona G">G Yona</name>
</author>
<author>
<name sortKey="Linial, N" uniqKey="Linial N">N Linial</name>
</author>
<author>
<name sortKey="Linial, M" uniqKey="Linial M">M Linial</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sasson, O" uniqKey="Sasson O">O Sasson</name>
</author>
<author>
<name sortKey="Vaaknin, A" uniqKey="Vaaknin A">A Vaaknin</name>
</author>
<author>
<name sortKey="Fleischer, H" uniqKey="Fleischer H">H Fleischer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Park, J" uniqKey="Park J">J Park</name>
</author>
<author>
<name sortKey="Holm, L" uniqKey="Holm L">L Holm</name>
</author>
<author>
<name sortKey="Heger, A" uniqKey="Heger A">A Heger</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Enright, Aj" uniqKey="Enright A">AJ Enright</name>
</author>
<author>
<name sortKey="Ouzounis, Ca" uniqKey="Ouzounis C">CA Ouzounis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Enright, Aj" uniqKey="Enright A">AJ Enright</name>
</author>
<author>
<name sortKey="Van Dongen, S" uniqKey="Van Dongen S">S Van Dongen</name>
</author>
<author>
<name sortKey="Ouzounis, Ca" uniqKey="Ouzounis C">CA Ouzounis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pipenbacher, P" uniqKey="Pipenbacher P">P Pipenbacher</name>
</author>
<author>
<name sortKey="Schliep, A" uniqKey="Schliep A">A Schliep</name>
</author>
<author>
<name sortKey="Schneckener, S" uniqKey="Schneckener S">S Schneckener</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mika, S" uniqKey="Mika S">S Mika</name>
</author>
<author>
<name sortKey="Rost, B" uniqKey="Rost B">B Rost</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, L" uniqKey="Li L">L Li</name>
</author>
<author>
<name sortKey="Stoeckert, Cj" uniqKey="Stoeckert C">CJ Stoeckert</name>
</author>
<author>
<name sortKey="Roos, Ds" uniqKey="Roos D">DS Roos</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Loewenstein, Y" uniqKey="Loewenstein Y">Y Loewenstein</name>
</author>
<author>
<name sortKey="Portugaly, E" uniqKey="Portugaly E">E Portugaly</name>
</author>
<author>
<name sortKey="Fromer, M" uniqKey="Fromer M">M Fromer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author>
<name sortKey="Madden, Tl" uniqKey="Madden T">TL Madden</name>
</author>
<author>
<name sortKey="Schaffer, Aa" uniqKey="Schaffer A">AA Schaffer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huang, Y" uniqKey="Huang Y">Y Huang</name>
</author>
<author>
<name sortKey="Niu, B" uniqKey="Niu B">B Niu</name>
</author>
<author>
<name sortKey="Gao, Y" uniqKey="Gao Y">Y Gao</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, Wz" uniqKey="Li W">WZ Li</name>
</author>
<author>
<name sortKey="Godzik, A" uniqKey="Godzik A">A Godzik</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, Wz" uniqKey="Li W">WZ Li</name>
</author>
<author>
<name sortKey="Jaroszewski, L" uniqKey="Jaroszewski L">L Jaroszewski</name>
</author>
<author>
<name sortKey="Godzik, A" uniqKey="Godzik A">A Godzik</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, Wz" uniqKey="Li W">WZ Li</name>
</author>
<author>
<name sortKey="Jaroszewski, L" uniqKey="Jaroszewski L">L Jaroszewski</name>
</author>
<author>
<name sortKey="Godzik, A" uniqKey="Godzik A">A Godzik</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Boguski, Ms" uniqKey="Boguski M">MS Boguski</name>
</author>
<author>
<name sortKey="Schuler, Gd" uniqKey="Schuler G">GD Schuler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pertea, G" uniqKey="Pertea G">G Pertea</name>
</author>
<author>
<name sortKey="Huang, X" uniqKey="Huang X">X Huang</name>
</author>
<author>
<name sortKey="Liang, F" uniqKey="Liang F">F Liang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Burke, J" uniqKey="Burke J">J Burke</name>
</author>
<author>
<name sortKey="Davison, D" uniqKey="Davison D">D Davison</name>
</author>
<author>
<name sortKey="Hide, W" uniqKey="Hide W">W Hide</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Malde, K" uniqKey="Malde K">K Malde</name>
</author>
<author>
<name sortKey="Coward, E" uniqKey="Coward E">E Coward</name>
</author>
<author>
<name sortKey="Jonassen, I" uniqKey="Jonassen I">I Jonassen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ptitsyn, A" uniqKey="Ptitsyn A">A Ptitsyn</name>
</author>
<author>
<name sortKey="Hide, W" uniqKey="Hide W">W Hide</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hazelhurst, S" uniqKey="Hazelhurst S">S Hazelhurst</name>
</author>
<author>
<name sortKey="Liptak, Z" uniqKey="Liptak Z">Z Lipták</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Suzek, Be" uniqKey="Suzek B">BE Suzek</name>
</author>
<author>
<name sortKey="Huang, Hz" uniqKey="Huang H">HZ Huang</name>
</author>
<author>
<name sortKey="Mcgarvey, P" uniqKey="Mcgarvey P">P McGarvey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Edgar, Rc" uniqKey="Edgar R">RC Edgar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ghodsi, M" uniqKey="Ghodsi M">M Ghodsi</name>
</author>
<author>
<name sortKey="Liu, B" uniqKey="Liu B">B Liu</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bao, E" uniqKey="Bao E">E Bao</name>
</author>
<author>
<name sortKey="Jiang, T" uniqKey="Jiang T">T Jiang</name>
</author>
<author>
<name sortKey="Kaloshian, I" uniqKey="Kaloshian I">I Kaloshian</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sun, S" uniqKey="Sun S">S Sun</name>
</author>
<author>
<name sortKey="Chen, J" uniqKey="Chen J">J Chen</name>
</author>
<author>
<name sortKey="Li, W" uniqKey="Li W">W Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wu, S" uniqKey="Wu S">S Wu</name>
</author>
<author>
<name sortKey="Zhu, Z" uniqKey="Zhu Z">Z Zhu</name>
</author>
<author>
<name sortKey="Fu, L" uniqKey="Fu L">L Fu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Costello, Ek" uniqKey="Costello E">EK Costello</name>
</author>
<author>
<name sortKey="Lauber, Cl" uniqKey="Lauber C">CL Lauber</name>
</author>
<author>
<name sortKey="Hamady, M" uniqKey="Hamady M">M Hamady</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Meyer, F" uniqKey="Meyer F">F Meyer</name>
</author>
<author>
<name sortKey="Paarmann, D" uniqKey="Paarmann D">D Paarmann</name>
</author>
<author>
<name sortKey="D Souza, M" uniqKey="D Souza M">M D'Souza</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Markowitz, Vm" uniqKey="Markowitz V">VM Markowitz</name>
</author>
<author>
<name sortKey="Ivanova, Nn" uniqKey="Ivanova N">NN Ivanova</name>
</author>
<author>
<name sortKey="Szeto, E" uniqKey="Szeto E">E Szeto</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rusch, Db" uniqKey="Rusch D">DB Rusch</name>
</author>
<author>
<name sortKey="Halpern, Al" uniqKey="Halpern A">AL Halpern</name>
</author>
<author>
<name sortKey="Sutton, G" uniqKey="Sutton G">G Sutton</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Turnbaugh, Pj" uniqKey="Turnbaugh P">PJ Turnbaugh</name>
</author>
<author>
<name sortKey="Hamady, M" uniqKey="Hamady M">M Hamady</name>
</author>
<author>
<name sortKey="Yatsunenko, T" uniqKey="Yatsunenko T">T Yatsunenko</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schloss, Pd" uniqKey="Schloss P">PD Schloss</name>
</author>
<author>
<name sortKey="Handelsman, J" uniqKey="Handelsman J">J Handelsman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sun, Y" uniqKey="Sun Y">Y Sun</name>
</author>
<author>
<name sortKey="Cai, Y" uniqKey="Cai Y">Y Cai</name>
</author>
<author>
<name sortKey="Huse, Sm" uniqKey="Huse S">SM Huse</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sun, Y" uniqKey="Sun Y">Y Sun</name>
</author>
<author>
<name sortKey="Cai, Y" uniqKey="Cai Y">Y Cai</name>
</author>
<author>
<name sortKey="Liu, L" uniqKey="Liu L">L Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huse, Sm" uniqKey="Huse S">SM Huse</name>
</author>
<author>
<name sortKey="Welch, Dm" uniqKey="Welch D">DM Welch</name>
</author>
<author>
<name sortKey="Morrison, Hg" uniqKey="Morrison H">HG Morrison</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kunin, V" uniqKey="Kunin V">V Kunin</name>
</author>
<author>
<name sortKey="Engelbrektson, A" uniqKey="Engelbrektson A">A Engelbrektson</name>
</author>
<author>
<name sortKey="Ochman, H" uniqKey="Ochman H">H Ochman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Quince, C" uniqKey="Quince C">C Quince</name>
</author>
<author>
<name sortKey="Lanzen, A" uniqKey="Lanzen A">A Lanzen</name>
</author>
<author>
<name sortKey="Curtis, Tp" uniqKey="Curtis T">TP Curtis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Reeder, J" uniqKey="Reeder J">J Reeder</name>
</author>
<author>
<name sortKey="Knight, R" uniqKey="Knight R">R Knight</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Quince, C" uniqKey="Quince C">C Quince</name>
</author>
<author>
<name sortKey="Lanzen, A" uniqKey="Lanzen A">A Lanzen</name>
</author>
<author>
<name sortKey="Davenport, Rj" uniqKey="Davenport R">RJ Davenport</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Caporaso, Jg" uniqKey="Caporaso J">JG Caporaso</name>
</author>
<author>
<name sortKey="Kuczynski, J" uniqKey="Kuczynski J">J Kuczynski</name>
</author>
<author>
<name sortKey="Stombaugh, J" uniqKey="Stombaugh J">J Stombaugh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Qu, W" uniqKey="Qu W">W Qu</name>
</author>
<author>
<name sortKey="Hashimoto, S" uniqKey="Hashimoto S">S Hashimoto</name>
</author>
<author>
<name sortKey="Morishita, S" uniqKey="Morishita S">S Morishita</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhao, X" uniqKey="Zhao X">X Zhao</name>
</author>
<author>
<name sortKey="Palmer, Le" uniqKey="Palmer L">LE Palmer</name>
</author>
<author>
<name sortKey="Bolanos, R" uniqKey="Bolanos R">R Bolanos</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Medvedev, P" uniqKey="Medvedev P">P Medvedev</name>
</author>
<author>
<name sortKey="Scott, E" uniqKey="Scott E">E Scott</name>
</author>
<author>
<name sortKey="Kakaradov, B" uniqKey="Kakaradov B">B Kakaradov</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kao, Wc" uniqKey="Kao W">WC Kao</name>
</author>
<author>
<name sortKey="Chan, Ah" uniqKey="Chan A">AH Chan</name>
</author>
<author>
<name sortKey="Song, Ys" uniqKey="Song Y">YS Song</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, R" uniqKey="Li R">R Li</name>
</author>
<author>
<name sortKey="Yu, C" uniqKey="Yu C">C Yu</name>
</author>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zerbino, Dr" uniqKey="Zerbino D">DR Zerbino</name>
</author>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E Birney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, R" uniqKey="Li R">R Li</name>
</author>
<author>
<name sortKey="Zhu, H" uniqKey="Zhu H">H Zhu</name>
</author>
<author>
<name sortKey="Ruan, J" uniqKey="Ruan J">J Ruan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kent, Wj" uniqKey="Kent W">WJ Kent</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Niu, B" uniqKey="Niu B">B Niu</name>
</author>
<author>
<name sortKey="Zhu, Z" uniqKey="Zhu Z">Z Zhu</name>
</author>
<author>
<name sortKey="Fu, L" uniqKey="Fu L">L Fu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ye, Y" uniqKey="Ye Y">Y Ye</name>
</author>
<author>
<name sortKey="Choi, Jh" uniqKey="Choi J">JH Choi</name>
</author>
<author>
<name sortKey="Tang, H" uniqKey="Tang H">H Tang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, Wz" uniqKey="Li W">WZ Li</name>
</author>
<author>
<name sortKey="Jaroszewski, L" uniqKey="Jaroszewski L">L Jaroszewski</name>
</author>
<author>
<name sortKey="Godzik, A" uniqKey="Godzik A">A Godzik</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pruesse, E" uniqKey="Pruesse E">E Pruesse</name>
</author>
<author>
<name sortKey="Quast, C" uniqKey="Quast C">C Quast</name>
</author>
<author>
<name sortKey="Knittel, K" uniqKey="Knittel K">K Knittel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cole, Jr" uniqKey="Cole J">JR Cole</name>
</author>
<author>
<name sortKey="Wang, Q" uniqKey="Wang Q">Q Wang</name>
</author>
<author>
<name sortKey="Cardenas, E" uniqKey="Cardenas E">E Cardenas</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Desantis, Tz" uniqKey="Desantis T">TZ DeSantis</name>
</author>
<author>
<name sortKey="Hugenholtz, P" uniqKey="Hugenholtz P">P Hugenholtz</name>
</author>
<author>
<name sortKey="Larsen, N" uniqKey="Larsen N">N Larsen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ogata, H" uniqKey="Ogata H">H Ogata</name>
</author>
<author>
<name sortKey="Goto, S" uniqKey="Goto S">S Goto</name>
</author>
<author>
<name sortKey="Sato, K" uniqKey="Sato K">K Sato</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rychlewski, L" uniqKey="Rychlewski L">L Rychlewski</name>
</author>
<author>
<name sortKey="Jaroszewski, L" uniqKey="Jaroszewski L">L Jaroszewski</name>
</author>
<author>
<name sortKey="Li, Wz" uniqKey="Li W">WZ Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yooseph, S" uniqKey="Yooseph S">S Yooseph</name>
</author>
<author>
<name sortKey="Li, W" uniqKey="Li W">W Li</name>
</author>
<author>
<name sortKey="Sutton, G" uniqKey="Sutton G">G Sutton</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, W" uniqKey="Li W">W Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Noguchi, H" uniqKey="Noguchi H">H Noguchi</name>
</author>
<author>
<name sortKey="Park, J" uniqKey="Park J">J Park</name>
</author>
<author>
<name sortKey="Takagi, T" uniqKey="Takagi T">T Takagi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Finn, Rd" uniqKey="Finn R">RD Finn</name>
</author>
<author>
<name sortKey="Mistry, J" uniqKey="Mistry J">J Mistry</name>
</author>
<author>
<name sortKey="Tate, J" uniqKey="Tate J">J Tate</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Eddy, Sr" uniqKey="Eddy S">SR Eddy</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Brief Bioinform</journal-id>
<journal-id journal-id-type="iso-abbrev">Brief. Bioinformatics</journal-id>
<journal-id journal-id-type="publisher-id">bib</journal-id>
<journal-id journal-id-type="hwp">bib</journal-id>
<journal-title-group>
<journal-title>Briefings in Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="ppub">1467-5463</issn>
<issn pub-type="epub">1477-4054</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">22772836</article-id>
<article-id pub-id-type="pmc">3504929</article-id>
<article-id pub-id-type="doi">10.1093/bib/bbs035</article-id>
<article-id pub-id-type="publisher-id">bbs035</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Papers</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Ultrafast clustering algorithms for metagenomic sequence analysis</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Li</surname>
<given-names>Weizhong</given-names>
</name>
<xref ref-type="bio" rid="d34e36">*</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Fu</surname>
<given-names>Limin</given-names>
</name>
<xref ref-type="bio" rid="d34e47">*</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Niu</surname>
<given-names>Beifang</given-names>
</name>
<xref ref-type="bio" rid="d34e58">*</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Wu</surname>
<given-names>Sitao</given-names>
</name>
<xref ref-type="bio" rid="d34e69">*</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Wooley</surname>
<given-names>John</given-names>
</name>
<xref ref-type="bio" rid="d34e80">*</xref>
</contrib>
</contrib-group>
<author-notes>
<corresp>Corresponding author. Weizhong Li. Center for Research in Biological Systems, University of California San Diego, La Jolla, CA 92093, USA. Tel:
<phone>858-534-4143</phone>
; Fax:
<fax>858-246-0644</fax>
; E-mail:
<email>liwz@sdsc.edu</email>
</corresp>
</author-notes>
<pub-date pub-type="ppub">
<month>11</month>
<year>2012</year>
</pub-date>
<pub-date pub-type="epub">
<day>6</day>
<month>7</month>
<year>2012</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>6</day>
<month>7</month>
<year>2012</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>13</volume>
<issue>6</issue>
<issue-title>Special Issue: Bioinformatics approaches and tools for metagenomic analysis</issue-title>
<fpage>656</fpage>
<lpage>668</lpage>
<history>
<date date-type="received">
<day>7</day>
<month>3</month>
<year>2012</year>
</date>
<date date-type="accepted">
<day>30</day>
<month>5</month>
<year>2012</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s) 2012. Published by Oxford University Press.</copyright-statement>
<copyright-year>2012</copyright-year>
<license license-type="creative-commons" xlink:href="http://creativecommons.org/licenses/by-nc/3.0">
<license-p>
<pmc-comment>CREATIVE COMMONS</pmc-comment>
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/3.0">http://creativecommons.org/licenses/by-nc/3.0</ext-link>
), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract>
<p>The rapid advances of high-throughput sequencing technologies dramatically prompted metagenomic studies of microbial communities that exist at various environments. Fundamental questions in metagenomics include the identities, composition and dynamics of microbial populations and their functions and interactions. However, the massive quantity and the comprehensive complexity of these sequence data pose tremendous challenges in data analysis. These challenges include but are not limited to ever-increasing computational demand, biased sequence sampling, sequence errors, sequence artifacts and novel sequences. Sequence clustering methods can directly answer many of the fundamental questions by grouping similar sequences into families. In addition, clustering analysis also addresses the challenges in metagenomics. Thus, a large redundant data set can be represented with a small non-redundant set, where each cluster can be represented by a single entry or a consensus. Artifacts can be rapidly detected through clustering. Errors can be identified, filtered or corrected by using consensus from sequences within clusters.</p>
</abstract>
<kwd-group>
<kwd>clustering</kwd>
<kwd>metagenomics</kwd>
<kwd>next-generation sequencing</kwd>
<kwd>protein families</kwd>
<kwd>artificial duplicates</kwd>
<kwd>OTU</kwd>
</kwd-group>
<counts>
<page-count count="13"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec>
<title>INTRODUCTION</title>
<p>Metagenomics [
<xref ref-type="bibr" rid="bbs035-B1">1</xref>
,
<xref ref-type="bibr" rid="bbs035-B2">2</xref>
] is a genomic approach that uses culture-independent sequencing to study the microorganism populations under different environments. It offers unprecedented vision of the identities, composition, dynamics, functions and interactions of the diverse microbial world and has become an important tool in many fields such as ecology, energy, agriculture and medicine.</p>
<p>Earlier metagenomics projects, such as Sargasso Sea [
<xref ref-type="bibr" rid="bbs035-B3">3</xref>
], human gut [
<xref ref-type="bibr" rid="bbs035-B4">4</xref>
] and soil [
<xref ref-type="bibr" rid="bbs035-B5">5</xref>
], relied on traditional Sanger sequencing technology, so most of these projects have limited throughput. In recent years, the rapid advances of next-generation sequencing (NGS) technologies [
<xref ref-type="bibr" rid="bbs035-B6">6</xref>
], such as 454, Illumina, SOLiD, PacBio and Ion Torrent, dramatically propelled the expansion of metagenomics research, and large ‘waves’ of metagenomics sequencing projects were launched to study a diverse range of microbial communities in their environments, such as the virome [
<xref ref-type="bibr" rid="bbs035-B7">7</xref>
], farm animals [
<xref ref-type="bibr" rid="bbs035-B8">8</xref>
] and the human microbiome [
<xref ref-type="bibr" rid="bbs035-B9">9</xref>
,
<xref ref-type="bibr" rid="bbs035-B10">10</xref>
]. It is widely expected that many more environmental and microbiome samples will be studied by NGS technologies. However, the intrinsic complexity and massive quantity of metagenomics data create tremendous challenges for data analysis.</p>
<p>First, because of the sheer number of sequences, all kinds of sequence analyses, including database search, multiple alignment, sequence mapping, assembly and phylogenetic analysis, are getting more time consuming and memory demanding and require more manual efforts for parsing the output results. Second, the growth of sequence data in the public databases has been very uneven due to highly biased efforts toward model organisms and those populations or environments of special interest. Metagenomic analyses that rely on comparison with these biased and redundant reference databases may lead to incorrect conclusions. Third, different NGS techniques and protocols show quite different bias and artifacts. For example, single-cell multiple displacement amplification produces very non-uniform coverage by orders of magnitude [
<xref ref-type="bibr" rid="bbs035-B11">11</xref>
]. Many sequencers generate tens or hundreds of copies of artificially duplicated reads for same templates [
<xref ref-type="bibr" rid="bbs035-B12">12</xref>
]. Fourth, NGS platforms have higher error rates than traditional Sanger sequencers and also have platform-specific error patterns, such as homopolymer indels for 454 and Ion Torrent reads and degraded quality at 3′-ends for Illumina reads. Finally, sequence errors and artifacts are propagated from reads to protein sequences, which can be false genes, fragmented or with frame-shift errors.</p>
<p>Clustering analysis, a method that identifies and groups similar objects, is a powerful tool to explore and study large-scale complex data. It can effectively resolve many of the challenges stated earlier. By sequence clustering, a large redundant data set can be represented with a small non-redundant (NR) set, which requires less computation. Errors can be identified, filtered or corrected by using consensus from sequences within clusters. In addition, many fundamental questions in metagenomics can be readily addressed by clustering, such as the identification of gene families and the classification of species in a population. So, since the infancy of metagenomics, clustering analysis has been an essential part of this field for applications, such as identification of artificial duplicates [
<xref ref-type="bibr" rid="bbs035-B12">12</xref>
,
<xref ref-type="bibr" rid="bbs035-B13">13</xref>
], classification of operational taxonomic units (OTUs) [
<xref ref-type="bibr" rid="bbs035-B14">14</xref>
], protein family analysis [
<xref ref-type="bibr" rid="bbs035-B15">15</xref>
,
<xref ref-type="bibr" rid="bbs035-B16">16</xref>
] and transcriptomics analysis [
<xref ref-type="bibr" rid="bbs035-B17">17</xref>
].</p>
<p>In this article, we will discuss several common clustering applications in metagenomics and the methodologies for different types of analysis.</p>
</sec>
<sec>
<title>CLUSTERING METHODS AND RESOURCES</title>
<p>Sequence clustering is not a new topic; it existed long before the emerging of metagenomics and NGS technologies. In the past, many available clustering programs were used for clustering protein sequences such as ProtoMap [
<xref ref-type="bibr" rid="bbs035-B18">18</xref>
], ProtoNet [
<xref ref-type="bibr" rid="bbs035-B19">19</xref>
], RSDB [
<xref ref-type="bibr" rid="bbs035-B20">20</xref>
], GeneRAGE [
<xref ref-type="bibr" rid="bbs035-B21">21</xref>
], TribeMCL [
<xref ref-type="bibr" rid="bbs035-B22">22</xref>
], ProClust [
<xref ref-type="bibr" rid="bbs035-B23">23</xref>
], UniqueProt [
<xref ref-type="bibr" rid="bbs035-B24">24</xref>
], OrthMCL [
<xref ref-type="bibr" rid="bbs035-B25">25</xref>
], MC-UPGMA [
<xref ref-type="bibr" rid="bbs035-B26">26</xref>
], Blastclust [
<xref ref-type="bibr" rid="bbs035-B27">27</xref>
] and CD-HIT [
<xref ref-type="bibr" rid="bbs035-B28 bbs035-B29 bbs035-B30 bbs035-B31">28–31</xref>
]. Many methods were also used for clustering expressed sequence tags (ESTs), such as Unigene [
<xref ref-type="bibr" rid="bbs035-B32">32</xref>
], TIGR Gene Indices [
<xref ref-type="bibr" rid="bbs035-B33">33</xref>
], d2_cluster [
<xref ref-type="bibr" rid="bbs035-B34">34</xref>
] and several others [
<xref ref-type="bibr" rid="bbs035-B35 bbs035-B36 bbs035-B37">35–37</xref>
].</p>
<p>Many of the above clustering methods require all against all comparisons of sequences for optimal results, so they are very computational intensive for large data sets. A method for reducing the intensive requirement arose with CD-HIT. Thus, with the rapid growth of sequence data, the fast program CD-HIT become a very popular clustering tool; it has been widely used in many areas such as preparing NR reference databases [
<xref ref-type="bibr" rid="bbs035-B38">38</xref>
]. CD-HIT uses a greedy incremental algorithm. Basically, sequences are first ordered by decreasing length, and the longest one becomes the seed of the first cluster. Then, each remaining sequence is compared with existing seeds. If the similarity with any seed meets a pre-defined cutoff, it is grouped into that cluster; otherwise, it becomes the seed of a new cluster. More recently, several new fast programs, including Uclust [
<xref ref-type="bibr" rid="bbs035-B39">39</xref>
], DNACLUST [
<xref ref-type="bibr" rid="bbs035-B40">40</xref>
] and SEED [
<xref ref-type="bibr" rid="bbs035-B41">41</xref>
], have been developed using greedy incremental approaches similar to that introduced by CD-HIT. These methods use various heuristics and achieved high speed in clustering NGS sequences. Herein, we briefly introduce the features and functions of these programs.</p>
<p>CD-HIT [
<xref ref-type="bibr" rid="bbs035-B28 bbs035-B29 bbs035-B30 bbs035-B31">28–31</xref>
] is a comprehensive clustering package. The current version (v 4.5) has seven programs. CD-HIT and CD-HIT-EST cluster protein and deoxyribonucleic acid (DNA) data sets, respectively. CD-HIT-454 identifies duplicates from 454 reads. PSI-CD-HIT clusters proteins at low-identity cutoff (20–50%). CD-HIT-DUP identifies duplicates from single or paired Illumina reads. CD-HIT-LAP identifies overlapping reads. CD-HIT-OTU is a multi-step pipeline to generate OTU clusters for ribosomal ribonucleic acid (rRNA) tags from 454 and Illumina platforms. CD-HIT uses a heuristics based on statistical k-mer filtering to speed up clustering calculations. It also has a multi-threading function, so it can run in parallel on multi-core computers. CD-HIT is open source software available from
<ext-link ext-link-type="uri" xlink:href="http://cd-hit.org">http://cd-hit.org</ext-link>
. It is also available from the cd-hit web server [
<xref ref-type="bibr" rid="bbs035-B28">28</xref>
], the CAMERA web portal [
<xref ref-type="bibr" rid="bbs035-B42">42</xref>
] and the WebMGA server [
<xref ref-type="bibr" rid="bbs035-B43">43</xref>
] for metagenomic data analysis.</p>
<p>Uclust [
<xref ref-type="bibr" rid="bbs035-B39">39</xref>
] follows CD-HIT’s greedy incremental approaches, but it uses a heuristics called Usearch for fast sequence comparison. It also gains speed by comparing a few top sequences instead of the full database. Uclust can run on DNA, protein and rRNA sequences. Currently, its 32-bit pre-compiled binaries are freely available from
<ext-link ext-link-type="uri" xlink:href="http://www.drive5.com/usearch/">http://www.drive5.com/usearch/</ext-link>
. DNACLUST [
<xref ref-type="bibr" rid="bbs035-B40">40</xref>
] also follows greedy incremental approach; it uses a suffix array to index the input data set. Unlike CD-HIT and Uclust, which can process both proteins and DNAs, DNACLUST only works on DNA sequences, and it is suitable for clustering highly similar DNAs, especially for rRNA tags. It is available as open source program at
<ext-link ext-link-type="uri" xlink:href="http://dnaclust.sourceforge.net/">http://dnaclust.sourceforge.net/</ext-link>
. SEED [
<xref ref-type="bibr" rid="bbs035-B41">41</xref>
] only works with Illumina reads and only identifies up to three mismatches and three overhanging bases. It uses an open hashing technique and a special class of spaced seeds, called block spaced seed. SEED is also an open source software available at
<ext-link ext-link-type="uri" xlink:href="http://manuals.bioinformatics.ucr.edu/home/seed">http://manuals.bioinformatics.ucr.edu/home/seed</ext-link>
.</p>
<p>Although some programs are claimed to be faster than other programs, those claims are usually based on a certain type of sequences and clustering parameters (e.g. an identity cutoff). Herein, we do not intent to make a side-by-side performance comparison but simply list some examples that we ran to give some hints on the speed and results for some common clustering analyses by these programs (
<xref ref-type="table" rid="bbs035-T1">Table 1</xref>
). CD-HIT and Uclust often produce comparable results in both protein and DNA clustering tests. SEED is faster than other programs in clustering Illumina reads, but it yields many more clusters. Except for SEED, the other three programs all work on rRNA sequences, where Uclust is fastest and CD-HIT gives the fewest clusters.
<table-wrap id="bbs035-T1" position="float">
<label>Table 1:</label>
<caption>
<p>Clustering speed and results for common data sets</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">Data set
<xref ref-type="table-fn" rid="bbs035-TF1">
<sup>a</sup>
</xref>
</th>
<th rowspan="1" colspan="1">Program and parameters
<xref ref-type="table-fn" rid="bbs035-TF1">
<sup>b</sup>
</xref>
</th>
<th rowspan="1" colspan="1">Time
<xref ref-type="table-fn" rid="bbs035-TF1">
<sup>c</sup>
</xref>
(minutes)</th>
<th rowspan="1" colspan="1">Clusters</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="2" colspan="1">NCBI NR, proteins, 4.3 GB: 12 054 819 sequences
<xref ref-type="table-fn" rid="bbs035-TF1">
<sup>d</sup>
</xref>
</td>
<td rowspan="1" colspan="1">cd-hit v4.5.7 ‘-n 5 -M 0 -c 0.9’</td>
<td rowspan="1" colspan="1">1405/181</td>
<td rowspan="1" colspan="1">7 036 029</td>
</tr>
<tr>
<td rowspan="1" colspan="1">cd-hit v4.5.7 ‘-n 5 -M 0 -c 0.7’</td>
<td rowspan="1" colspan="1">962/152</td>
<td rowspan="1" colspan="1">4 933 074</td>
</tr>
<tr>
<td rowspan="4" colspan="1">Swissprot, proteins, 222 MB: 437 168 sequences</td>
<td rowspan="1" colspan="1">cd-hit 4.5.7 ‘-n 5 -M 0 -c 0.9’</td>
<td rowspan="1" colspan="1">3.7/0.8</td>
<td rowspan="1" colspan="1">298 617</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Uclust v5 ‘-id 0.9’</td>
<td rowspan="1" colspan="1">17.3</td>
<td rowspan="1" colspan="1">301 076</td>
</tr>
<tr>
<td rowspan="1" colspan="1">cd-hit 4.5.7 ‘-n 5 -M 0 -c 0.7’</td>
<td rowspan="1" colspan="1">4.6/0.8</td>
<td rowspan="1" colspan="1">190 695</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Uclust v5 ‘-id 0.7’</td>
<td rowspan="1" colspan="1">7.6</td>
<td rowspan="1" colspan="1">192 847</td>
</tr>
<tr>
<td rowspan="6" colspan="1">Illumina (SRR061270), 380 MB, 5 million reads</td>
<td rowspan="1" colspan="1">cd-hit v4.5.7 ‘-n 10 -M 0 -c 0.95’</td>
<td rowspan="1" colspan="1">56.8/9.2</td>
<td rowspan="1" colspan="1">956 734</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Uclust v5 ‘-id 0.95’</td>
<td rowspan="1" colspan="1">164.6</td>
<td rowspan="1" colspan="1">958 887</td>
</tr>
<tr>
<td rowspan="1" colspan="1">cd-hit v4.5.7 ‘-n 10 -M 0 -c 0.9’</td>
<td rowspan="1" colspan="1">347.5/46.0</td>
<td rowspan="1" colspan="1">751 581</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Uclust v5 ‘-id 0.9’</td>
<td rowspan="1" colspan="1">227.5</td>
<td rowspan="1" colspan="1">734 981</td>
</tr>
<tr>
<td rowspan="1" colspan="1">cd-hit v5.0 beta ‘-c 0.9’</td>
<td rowspan="1" colspan="1">23.5/4.0</td>
<td rowspan="1" colspan="1">750 276</td>
</tr>
<tr>
<td rowspan="1" colspan="1">SEED (default parameters)</td>
<td rowspan="1" colspan="1">7.9</td>
<td rowspan="1" colspan="1">1 056 109</td>
</tr>
<tr>
<td rowspan="3" colspan="1">1.1 million 16s rRNAs: 454 reads Ref. [
<xref ref-type="bibr" rid="bbs035-B44">44</xref>
]</td>
<td rowspan="1" colspan="1">cd-hit v4.5.7 ‘-n 10 -M 0 -c 0.97’</td>
<td rowspan="1" colspan="1">47.9/7.5</td>
<td rowspan="1" colspan="1">24 842</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Uclust v5 ‘-id 0.97’</td>
<td rowspan="1" colspan="1">4.3</td>
<td rowspan="1" colspan="1">29 586</td>
</tr>
<tr>
<td rowspan="1" colspan="1">DNACLUST ‘-s 0.97’</td>
<td rowspan="1" colspan="1">15.3</td>
<td rowspan="1" colspan="1">31 151</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="bbs035-TF1">
<p>
<sup>a</sup>
NR and Swissprot were downloaded from NCBI at
<ext-link ext-link-type="ftp" xlink:href="ftp://ftp.ncbi.nih.gov/blast/db/FASTA/">ftp://ftp.ncbi.nih.gov/blast/db/FASTA/</ext-link>
. Illumina reads from SRR061270 was downloaded from NCBI at
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/sra">http://www.ncbi.nlm.nih.gov/sra</ext-link>
. The 16s rRNAs was kindly provided by the authors from Ref. [
<xref ref-type="bibr" rid="bbs035-B44">44</xref>
].
<sup>b</sup>
<sup></sup>
-c 0.9’, ‘-id 0.9’ and ‘-s 0.9’ mean 90% identity. However, DNACLUST’s definition is slightly different from CD-HIT and Uclust (Ref. [
<xref ref-type="bibr" rid="bbs035-B40">40</xref>
]).
<sup>c</sup>
The second number is the time for eight cores; currently, only CD-HIT has a multiple threading function.
<sup>d</sup>
The free 32-bit version of Uclust cannot process NR, so only CD-HIT is used.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec>
<title>IDENTIFICATION OF ARTIFICIAL DUPLICATES</title>
<p>NGS platforms, such as 454 and Illumina, commonly produce artificially duplicated reads, which can lead to an overestimated abundance of species, genes or functions. The duplicates originate from the same template but are separately sequenced, so they can be exactly identical or can be nearly identical with variable read lengths (454 reads) and mismatches due to sequence errors.</p>
<p>In 454 data sets, duplicated reads can make up 11–35% of the raw reads [
<xref ref-type="bibr" rid="bbs035-B12">12</xref>
]. As finding identical sequences is very easy, only exact duplicates were identified and removed in some early studies [
<xref ref-type="bibr" rid="bbs035-B7">7</xref>
]. Nearly identical duplicates were considered ever since the study by Gomez-Alvarez
<italic>et al</italic>
. [
<xref ref-type="bibr" rid="bbs035-B12">12</xref>
]. Gomez-Alvarez’s method applies CD-HIT-EST to cluster the reads at 90% identity and then parses the clustering results. Later, CD-HIT-454 [
<xref ref-type="bibr" rid="bbs035-B13">13</xref>
] was introduced by reengineering CD-HIT-EST. CD-HIT-454 is faster and more accurate than Gomez-Alvarez’s method. It identifies duplicates that are either exactly identical or meet the following criteria: (i) reads must be aligned at 5′-ends; (ii) for sequences of different length, a shorter read must be fully aligned with the longer one (the seed) and (iii) they have less than user-defined percentage of indels and substitutions (default 4%). The default cut-off value, which is trained according to the pyrosequencing’s error model, maximizes the sensitivity and specificity of identification of duplicates from 454 reads. Another common, easy way for finding duplicates is to compare prefixes and consider that the reads are duplicates if they share a common prefix of a certain length. Both MG-RAST [
<xref ref-type="bibr" rid="bbs035-B45">45</xref>
] and IMG/M [
<xref ref-type="bibr" rid="bbs035-B46">46</xref>
] use prefix checking for identification of duplicates. Prefix checking is faster but less accurate than CD-HIT-454. CD-HIT-454 only needs a few minutes to run a typical 454 data set with less than a million reads, so it is still very efficient, similar to the original CD-HIT.</p>
<p>For Illumina data sets, prefix checking has more advantages, because it is relatively faster than CD-HIT-454, and it fits features of Illumina reads, which have fewer indels and exhibit worse quality at the 3′-ends. For pair-ended Illumina reads, a reasonable way for finding duplicates is to check prefixes at both ends. This function is available in CD-HIT-DUP.</p>
<p>When removing duplicates, a question that needs to be considered is ‘are these duplicates all artificial’? The experimentally observed duplicated sequences also include natural duplicates, i.e. those that happen to be duplicates by chance. So, simply removing all duplicates may also cause an underestimation of abundance associated with natural duplicates. The CD-HIT-454 article investigated the occurrence of natural duplicates for different types of metagenomic samples and found that (i) the rate of natural duplicates highly correlates with the read density (the number of reads divided by genome size); (ii) for high-complexity metagenomic samples, natural duplicates make up a few percent of all duplicates and (iii) for viral metagenomic samples or metatranscriptomics, natural duplicates can be more abundant than artificial duplicates. These guidelines help to decide whether to remove or to keep duplicated reads in a metagenomic sample [
<xref ref-type="bibr" rid="bbs035-B13">13</xref>
].</p>
</sec>
<sec>
<title>DIVERSITY</title>
<p>Metagenomic projects (e.g. [
<xref ref-type="bibr" rid="bbs035-B9">9</xref>
,
<xref ref-type="bibr" rid="bbs035-B47">47</xref>
,
<xref ref-type="bibr" rid="bbs035-B48">48</xref>
]) often survey both genomic DNAs and 16S rRNAs. The later are used to estimate the microbial diversity, which is often quantitatively described in OTUs. Because of read length limitation, it is not practical to sequence the full length of 16s rRNA (∼1.5 kb), so 16s rRNA studies often use individual variable regions (V1–V9) or sections that cover a few variable regions (e.g. V1–V3 and V3–V5). Pyrosequencing of 16S rRNA amplicons has been the dominant approach in rRNA studies. Finding OTUs from 16S rRNA tags can be readily addressed by clustering. Conventionally, tags with ≥97% identity are placed in the same OTUs at the species level. CD-HIT [
<xref ref-type="bibr" rid="bbs035-B29">29</xref>
] and DOTUR [
<xref ref-type="bibr" rid="bbs035-B49">49</xref>
] were often used for OTU clustering during early studies.</p>
<p>However, a big problem in OTU analysis is that directly clustering the raw rRNA reads or even the high-quality reads often greatly over-estimates the diversity. A recent review [
<xref ref-type="bibr" rid="bbs035-B50">50</xref>
] analyzed a list of methods and discussed solving this problem at the clustering algorithm level. This article suggested using average linkage-based hierarchical clustering methods such as ESPRIT [
<xref ref-type="bibr" rid="bbs035-B51">51</xref>
], instead of greedy incremental methods such as CD-HIT [
<xref ref-type="bibr" rid="bbs035-B29">29</xref>
] and Uclust [
<xref ref-type="bibr" rid="bbs035-B39">39</xref>
] for OTU clustering.</p>
<p>In the meantime, many other studies [
<xref ref-type="bibr" rid="bbs035-B52 bbs035-B53 bbs035-B54 bbs035-B55 bbs035-B56">52–56</xref>
] found that the single biggest cause of the over-estimation problem is the sequence errors or noise, so new methods such as SLP [
<xref ref-type="bibr" rid="bbs035-B52">52</xref>
], PyroNoise [
<xref ref-type="bibr" rid="bbs035-B54">54</xref>
], Denoiser [
<xref ref-type="bibr" rid="bbs035-B55">55</xref>
] and Ampliconnoise [
<xref ref-type="bibr" rid="bbs035-B56">56</xref>
] focus at identifying and removing sequence noise. All these methods find sequence errors by clustering analysis and are based on a principle that a high-abundance cluster can recruit small clusters and singletons, which have more sequence errors. SLP clusters the actual rRNA tags, and the rest of the methods cluster the original flowgram data. Currently, the best performing method among them is AmpliconNoise [
<xref ref-type="bibr" rid="bbs035-B56">56</xref>
], which has been benchmarked by several commonly used Mock data sets; these data sets are artificial mixtures of 16S rRNA clones at different abundance levels from a number of known species.</p>
<p>Although the speed of AmpliconNoise is considerably improved over its predecessor version (PyroNoise), it is still quite computational intensive. Recently, CD-HIT-OTU was introduced to the CD-HIT package. CD-HIT-OTU also uses a multi-step clustering method to remove reads with sequence errors and achieves results comparable with AmpliconNoise. However, as CD-HIT-OTU clusters sequences instead of flowgram data and inherits unique heuristics from CD-HIT, it is orders of magnitude faster than AmpliconNoise and other methods such as Denoiser.
<xref ref-type="table" rid="bbs035-T2">Table 2</xref>
lists the performance of CD-HIT-OTU, AmpliconNoise and Denoiser (implemented in QIIME [
<xref ref-type="bibr" rid="bbs035-B57">57</xref>
]) on clustering the Mock benchmark data sets [
<xref ref-type="bibr" rid="bbs035-B56">56</xref>
] at 97% identity level.
<table-wrap id="bbs035-T2" position="float">
<label>Table 2:</label>
<caption>
<p>Accuracy and speed for OTUs identification
<xref ref-type="table-fn" rid="bbs035-TF2">
<sup>a</sup>
</xref>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">Data
<xref ref-type="table-fn" rid="bbs035-TF2">
<sup>b</sup>
</xref>
</th>
<th rowspan="1" colspan="1">True OTUs
<xref ref-type="table-fn" rid="bbs035-TF2">
<sup>c</sup>
</xref>
</th>
<th colspan="12" rowspan="1">Number of predicted OTUs, sensitivity (%), specificity (%), CPU time (h, min, s)
<hr></hr>
</th>
</tr>
<tr>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1"></th>
<th colspan="4" rowspan="1">CD-HIT-OTU</th>
<th colspan="4" rowspan="1">AmpliconNoise</th>
<th colspan="4" rowspan="1">Denoiser</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">Divergent</td>
<td rowspan="1" colspan="1">23</td>
<td rowspan="1" colspan="1">26</td>
<td rowspan="1" colspan="1">100</td>
<td rowspan="1" colspan="1">88</td>
<td rowspan="1" colspan="1">11 s</td>
<td rowspan="1" colspan="1">28</td>
<td rowspan="1" colspan="1">100</td>
<td rowspan="1" colspan="1">82</td>
<td rowspan="1" colspan="1">32 h</td>
<td rowspan="1" colspan="1">35</td>
<td rowspan="1" colspan="1">100</td>
<td rowspan="1" colspan="1">65</td>
<td rowspan="1" colspan="1">15 m</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Artificial</td>
<td rowspan="1" colspan="1">33</td>
<td rowspan="1" colspan="1">32</td>
<td rowspan="1" colspan="1">100</td>
<td rowspan="1" colspan="1">100</td>
<td rowspan="1" colspan="1">13 s</td>
<td rowspan="1" colspan="1">34</td>
<td rowspan="1" colspan="1">96</td>
<td rowspan="1" colspan="1">91</td>
<td rowspan="1" colspan="1">22 h</td>
<td rowspan="1" colspan="1">38</td>
<td rowspan="1" colspan="1">96</td>
<td rowspan="1" colspan="1">81</td>
<td rowspan="1" colspan="1">13 m</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Even1</td>
<td rowspan="1" colspan="1">53</td>
<td rowspan="1" colspan="1">71</td>
<td rowspan="1" colspan="1">100</td>
<td rowspan="1" colspan="1">74</td>
<td rowspan="1" colspan="1">8 s</td>
<td rowspan="1" colspan="1">85</td>
<td rowspan="1" colspan="1">100</td>
<td rowspan="1" colspan="1">62</td>
<td rowspan="1" colspan="1">68 h</td>
<td colspan="4" rowspan="1">NA
<xref ref-type="table-fn" rid="bbs035-TF2">
<sup>d</sup>
</xref>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Even2</td>
<td rowspan="1" colspan="1">53</td>
<td rowspan="1" colspan="1">57</td>
<td rowspan="1" colspan="1">96</td>
<td rowspan="1" colspan="1">89</td>
<td rowspan="1" colspan="1">7 s</td>
<td rowspan="1" colspan="1">83</td>
<td rowspan="1" colspan="1">100</td>
<td rowspan="1" colspan="1">63</td>
<td rowspan="1" colspan="1">49 h</td>
<td colspan="4" rowspan="1">NA
<xref ref-type="table-fn" rid="bbs035-TF2">
<sup>d</sup>
</xref>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Even3</td>
<td rowspan="1" colspan="1">52</td>
<td rowspan="1" colspan="1">60</td>
<td rowspan="1" colspan="1">100</td>
<td rowspan="1" colspan="1">86</td>
<td rowspan="1" colspan="1">7 s</td>
<td rowspan="1" colspan="1">90</td>
<td rowspan="1" colspan="1">100</td>
<td rowspan="1" colspan="1">57</td>
<td rowspan="1" colspan="1">65 h</td>
<td colspan="4" rowspan="1">NA
<xref ref-type="table-fn" rid="bbs035-TF2">
<sup>d</sup>
</xref>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Uneven1</td>
<td rowspan="1" colspan="1">49</td>
<td rowspan="1" colspan="1">56</td>
<td rowspan="1" colspan="1">91</td>
<td rowspan="1" colspan="1">80</td>
<td rowspan="1" colspan="1">5 s</td>
<td rowspan="1" colspan="1">76</td>
<td rowspan="1" colspan="1">97</td>
<td rowspan="1" colspan="1">63</td>
<td rowspan="1" colspan="1">39 h</td>
<td colspan="4" rowspan="1">NA
<xref ref-type="table-fn" rid="bbs035-TF2">
<sup>d</sup>
</xref>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Uneven2</td>
<td rowspan="1" colspan="1">41</td>
<td rowspan="1" colspan="1">45</td>
<td rowspan="1" colspan="1">85</td>
<td rowspan="1" colspan="1">77</td>
<td rowspan="1" colspan="1">7 s</td>
<td rowspan="1" colspan="1">67</td>
<td rowspan="1" colspan="1">95</td>
<td rowspan="1" colspan="1">58</td>
<td rowspan="1" colspan="1">35 h</td>
<td colspan="4" rowspan="1">NA
<xref ref-type="table-fn" rid="bbs035-TF2">
<sup>d</sup>
</xref>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Uneven3</td>
<td rowspan="1" colspan="1">38</td>
<td rowspan="1" colspan="1">42</td>
<td rowspan="1" colspan="1">100</td>
<td rowspan="1" colspan="1">90</td>
<td rowspan="1" colspan="1">7 s</td>
<td rowspan="1" colspan="1">73</td>
<td rowspan="1" colspan="1">97</td>
<td rowspan="1" colspan="1">50</td>
<td rowspan="1" colspan="1">44 h</td>
<td colspan="4" rowspan="1">NA
<xref ref-type="table-fn" rid="bbs035-TF2">
<sup>d</sup>
</xref>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Titanium</td>
<td rowspan="1" colspan="1">69</td>
<td rowspan="1" colspan="1">69</td>
<td rowspan="1" colspan="1">98</td>
<td rowspan="1" colspan="1">98</td>
<td rowspan="1" colspan="1">7 s</td>
<td rowspan="1" colspan="1">90</td>
<td rowspan="1" colspan="1">100</td>
<td rowspan="1" colspan="1">76</td>
<td rowspan="1" colspan="1">388 h</td>
<td rowspan="1" colspan="1">146</td>
<td rowspan="1" colspan="1">100</td>
<td rowspan="1" colspan="1">47</td>
<td rowspan="1" colspan="1">6 h</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="bbs035-TF2">
<p>
<sup>a</sup>
All data sets were downloaded from
<ext-link ext-link-type="uri" xlink:href="http://people.civil.gla.ac.uk/~quince/Data/AmpliconNoise.html">http://people.civil.gla.ac.uk/∼quince/Data/AmpliconNoise.html</ext-link>
according to an article [
<xref ref-type="bibr" rid="bbs035-B56">56</xref>
].
<sup>b</sup>
Parameters are based on each programs default setting.
<sup>c</sup>
True OTUs were calculated by clustering the reference sequences that are covered by the raw reads.
<sup>d</sup>
Flowgram data are only available in AmpliconNoise-specific format, so we can run AmpliconNoise but not Denoiser. However, Denoiser’s performance for these data sets can be referenced from an article [
<xref ref-type="bibr" rid="bbs035-B56">56</xref>
].</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
<p>CD-HIT-OTU has following steps: (i) the raw reads with ambiguous base calls are removed. Reads are also removed if their 5′-ends do not match user-provided primer sequence or a consensus, which is built from the 5′ of all reads of k bases (
<italic>k</italic>
 = 6 by default, adjustable by users). For long reads, it also trims off the tails portion at 3′-ends that are beyond median read length. (ii) Processed reads are clustered at 100% identity using CD-HIT-DUP. At this step, the reads from a unique rRNA template will form one large primary cluster (it contains error-free reads) and some small clusters, which contain reads with sequence errors. (iii) The representative sequences from step 2 are sorted by abundance and then clustered by CD-HIT-EST at a threshold that allows up to two mismatches. For example, 200-bp reads are clustered at 99.0% identity, so that small clusters are recruited into their primary clusters. (iv) Let
<italic>x</italic>
to be the median size of small clusters recruited into the most abundant primary cluster with two mismatches. Clusters smaller than
<italic>x</italic>
are dominated by reads with more than two errors from the most abundant template; so these clusters are removed. Herein,
<italic>x</italic>
is often very small (2 or 3), so that rare species will still be kept in the analysis. (5) The remaining representative sequences from step 2 are clustered into OTUs using CD-HIT-EST (parameters: -c 0.97 -n 10 -l 11 -p 1 -d 0 -g 1). Herein, option ‘-c 0.97’ means 97% identity. (6) The non-representative tags are recruited into the OTUs using CD-HIT-EST-2D (parameters: -c 0.97 -n 10 -l 11 -p 1 -d 0 -g 1).</p>
<p>The ultra-high speed of CD-HIT-OTU allows clustering multi-million rRNA tags pooled from a series of related samples. Such clustering can significantly increase the accuracy of OTU identification, because tags shared by different samples validate each other. Clustering pooled samples may identify very rare OTUs, which may be missed if individual samples are processed independently. We applied CD-HIT-OTU on two pooled data sets, Human_gut_V6 [
<xref ref-type="bibr" rid="bbs035-B48">48</xref>
] and Human_body_V2 [
<xref ref-type="bibr" rid="bbs035-B44">44</xref>
]; these include 33 gut samples from obese and lean twin families and 815 samples from different body sites, respectively (
<xref ref-type="table" rid="bbs035-T3">Table 3</xref>
). CD-HIT-OTU only used a few minutes for these two data sets.
<table-wrap id="bbs035-T3" position="float">
<label>Table 3:</label>
<caption>
<p>OTU analysis for pooled human gut and human samples</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">Data set
<xref ref-type="table-fn" rid="bbs035-TF9">
<sup>a</sup>
</xref>
</th>
<th rowspan="1" colspan="1">Reads</th>
<th rowspan="1" colspan="1">Region</th>
<th rowspan="1" colspan="1">Platform</th>
<th rowspan="1" colspan="1">OTUs</th>
<th rowspan="1" colspan="1">CPU (s)</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">Human_gut</td>
<td rowspan="1" colspan="1">817942</td>
<td rowspan="1" colspan="1">V6</td>
<td rowspan="1" colspan="1">GS 20</td>
<td rowspan="1" colspan="1">317</td>
<td rowspan="1" colspan="1">37</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Human_body</td>
<td rowspan="1" colspan="1">1071335</td>
<td rowspan="1" colspan="1">V2</td>
<td rowspan="1" colspan="1">GS FLX</td>
<td rowspan="1" colspan="1">238</td>
<td rowspan="1" colspan="1">295</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="bbs035-TF9">
<p>
<sup>a</sup>
The Human_gut data set was downloaded from
<ext-link ext-link-type="uri" xlink:href="http://gordonlab.wustl.edu/NatureTwins_2008/TurnbaughNature_11_30_08.html">http://gordonlab.wustl.edu/NatureTwins_2008/TurnbaughNature_11_30_08.html</ext-link>
. The Human_body data set was kindly provided by the authors from Ref. [
<xref ref-type="bibr" rid="bbs035-B44">44</xref>
].</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
<p>In this analysis, we found that clustering the pooled samples identified 19–80 more rare OTUs than clustering individual samples for the 33 human gut data sets. For the 815 human body data sets, clustering pooled samples found up to 50 more rare OTUs. Clustering pooled samples also provides a very straightforward way to define a ‘core microbiome’ and to compare the diversity and composition of samples. For example, we calculated NAT50 for each sample. Herein, NAT50 is a diversity indicator we defined, which stands for the number of most abundant taxonomic groups covering 50% populations.
<xref ref-type="fig" rid="bbs035-F1">Figure 1</xref>
shows that obese samples have less diversity than lean samples. Please note the abundance of OTUs is the abundance of rRNA genes and may not be the abundance of species, because the rRNA copy numbers are unknown. However, rRNA genes abundance largely correlates with species abundance. The full results for human gut and human body are also available as examples with the CD-HIT-OTU software, which is available from
<ext-link ext-link-type="uri" xlink:href="http://weizhongli-lab.org/cd-hit-otu">http://weizhongli-lab.org/cd-hit-otu</ext-link>
. CD-HIT-OTU is also available as a web server within WebMGA [
<xref ref-type="bibr" rid="bbs035-B43">43</xref>
], a collection of web servers for metagenomic data analysis.
<fig id="bbs035-F1" position="float">
<label>Figure 1:</label>
<caption>
<p>Distribution of microbial diversity measured by NATs (NAT20, NAT50, NAT80 and NAT99) for 33 human gut samples. The
<italic>x</italic>
-axis is NAT category. The
<italic>y</italic>
-axis is NAT value. Samples are colored by sample type (obese, over weight or lean). The results show that obese samples have less average NAT50 than the lean samples.</p>
</caption>
<graphic xlink:href="bbs035f1"></graphic>
</fig>
</p>
</sec>
<sec>
<title>FILTERING SEQUENCE ERRORS</title>
<p>As shown in the previous section, clustering-based approaches very well address sequencing errors in rRNA tags. Similar clustering analyses can also filter out errors in genomic and metagenomic reads and, therefore, improve sequence assembly, gene prediction and other analyses. However, finding errors from genomic reads is more difficult than from rRNA tags, which can be aligned at their 5′-ends, because they all start with the same universal primers. For genomic reads, there are several existing methods in detecting sequence errors by various clustering approaches. For example, FreClu [
<xref ref-type="bibr" rid="bbs035-B58">58</xref>
] and EDAR [
<xref ref-type="bibr" rid="bbs035-B59">59</xref>
] use k-mer frequency; Hammer [
<xref ref-type="bibr" rid="bbs035-B60">60</xref>
] uses a Hamming graph and ECHO [
<xref ref-type="bibr" rid="bbs035-B61">61</xref>
] clusters overlapping reads through k-mer hashing. These methods avoid very time-consuming full-length sequence alignment in clustering the reads. However, full-length sequence alignment is feasible using ultra-fast sequence clustering algorithms. For example, the analysis in the SEED article [
<xref ref-type="bibr" rid="bbs035-B41">41</xref>
] shows that genome assembly can be notably improved by only assembling cluster representatives.</p>
<p>Herein, we show an example using clustering-based filtering to improve metagenome assembly. Metagenomic samples often contain a small number of dominant organisms along with hundreds or more less abundant species. Because of sequencing errors, major problems in metagenome assembly often occur for the high-abundance species. Clustering methods, including the k-mer frequency-based approaches, benefit from high sequence redundancy, from which better consensus can be derived. So, the assembly difficult for the dominant species in metagenome can be effectively corrected.</p>
<p>Herein, as a demonstration, we use an Illumina data set representing the high-abundance species of a human gut sample (MH0006) from MetaHIT project [
<xref ref-type="bibr" rid="bbs035-B9">9</xref>
] at
<ext-link ext-link-type="uri" xlink:href="http://gutmeta.genomics.org.cn">http://gutmeta.genomics.org.cn</ext-link>
. The MetaHIT study also provided Sanger reads for sample MH0006 as reference and assembled them into 995 contigs. We mapped the Illumina reads to these reference contigs using SOAP2 [
<xref ref-type="bibr" rid="bbs035-B62">62</xref>
] (option: -M 4); 144 contigs have a coverage of at least 200. The reads mapped to these contigs are selected as a high-abundance subset, which contains 36 175 286 of 75 bp pair-ended reads. The clustering-based approach has the following steps: (i) reads are clustered with CD-HIT-EST (options: ‘-c 0.96 -n 10 -r 1 –aS 0.5 -b 2 -G 0’); (ii) for each cluster, we only kept at most
<italic>N</italic>
reads that have the best average quality score per base and filtered out the extra sequences, where
<italic>N</italic>
is a redundancy cutoff parameter and (iii) the remaining reads were assembled at different
<italic>N</italic>
levels and optimal assembles achieved. The comparison of contigs between original reads and filtered reads using Velvet [
<xref ref-type="bibr" rid="bbs035-B63">63</xref>
] and SOAPdenovo [
<xref ref-type="bibr" rid="bbs035-B64">64</xref>
] is shown in
<xref ref-type="fig" rid="bbs035-F2">Figure 2</xref>
. The filtered data sets largely improve the N50 and longest contig. Actually, for the unfiltered data set, because the coverage is so low, there is no valid N50. The accuracy and coverage are also much higher with the filtered reads.
<fig id="bbs035-F2" position="float">
<label>Figure 2:</label>
<caption>
<p>Assembly performance of the filtered reads for metagenomic sample MH0006.
<italic>x</italic>
-axis is the redundancy cutoff
<italic>N</italic>
. The length of the longest contig (kb) and N50 (kb) are plotted against the left
<italic>y</italic>
-axis. The accuracy and genome coverage are against the right
<italic>y</italic>
-axis. The assembly results for original reads are at far right side marked as ‘ALL’ on
<italic>x</italic>
-axis. The accuracy of contigs is the total length of correct contigs divided by the total length of all contigs. The genome coverage is the fraction of reference genome covered by the correct contigs.</p>
</caption>
<graphic xlink:href="bbs035f2"></graphic>
</fig>
</p>
<p>There are two reasons that we used the high-abundance subset instead of the full MH0006 data. First, sequence errors deteriorate sequence assembly for high-abundant reads. So, our filtering method only improves the high-abundance species. Second, we evaluated the contigs assembled from Illumina reads by comparing them with high-quality references (contigs from Sander reads). Most contigs assembled from the low-abundance Illumina reads cannot be mapped to any reference sequences. So, we cannot evaluate these contigs.</p>
</sec>
<sec>
<title>DATABASE SEARCH</title>
<p>In metagenomic projects, an important annotation step is to query the reads or Open Reading Frames (ORF) against reference databases of known genomes, DNAs or proteins with an alignment program such as basic local alignment search tool (BLAST) [
<xref ref-type="bibr" rid="bbs035-B27">27</xref>
], BWA [
<xref ref-type="bibr" rid="bbs035-B65">65</xref>
], BLAT [
<xref ref-type="bibr" rid="bbs035-B66">66</xref>
], FR-HIT [
<xref ref-type="bibr" rid="bbs035-B67">67</xref>
] or Rapsearch [
<xref ref-type="bibr" rid="bbs035-B68">68</xref>
]. Because of the huge size of both reference databases and the query, such database searches can be very time consuming. However, both reference databases and the query sample can be very redundant, so simply using NR data sets may save a great deal of computation time and, in some cases, also improve the accuracy of database search [
<xref ref-type="bibr" rid="bbs035-B69">69</xref>
].</p>
<p>As illustrated in
<xref ref-type="fig" rid="bbs035-F3">Figure 3</xref>
, before database searching, both the reference database and the query are clustered at certain similarity thresholds. Then, the NR sequences (i.e. representatives) from the query are aligned to the NR sequences in the reference database. Finally, the annotation results are copied from the representatives to other sequences in the same clusters.
<fig id="bbs035-F3" position="float">
<label>Figure 3:</label>
<caption>
<p>Using NR query and NR reference database for metagenome annotation.</p>
</caption>
<graphic xlink:href="bbs035f3"></graphic>
</fig>
</p>
<p>A big concern of this approach is to ascertain how much difference there is between annotations calculated from the NR data sets and from the original full data sets. The key of this approach is to use appropriate, conservative clustering parameters, such that the clusters are homogeneous, as required by annotation goals. For example, we can use 97% as the identity cutoff for 16S rRNAs at species level, to obtain a taxonomy annotation for 16S rRNA reads at a species level by aligning them to a reference rRNA database such as Silva [
<xref ref-type="bibr" rid="bbs035-B70">70</xref>
], RDP [
<xref ref-type="bibr" rid="bbs035-B71">71</xref>
] and Greengene [
<xref ref-type="bibr" rid="bbs035-B72">72</xref>
]. Then, the query and references should be clustered at a similarity cutoff greater than 97%. If the goal is to annotate ORFs using the KEGG database [
<xref ref-type="bibr" rid="bbs035-B73">73</xref>
], then clustering both KEGG reference sequences and the ORFs at 90% will be harmless, because sequences sharing 90% identity will rarely belong to different KEGG orthology groups.</p>
<p>To further reduce the annotation difference between NR data sets and full data sets, this approach should only be used to cluster sequences of similar length and with enough overlapping regions (
<xref ref-type="fig" rid="bbs035-F3">Figure 3</xref>
D), instead of other clustering settings (
<xref ref-type="fig" rid="bbs035-F3">Figure 3</xref>
B, C). The CD-HIT program has many parameters such as sequence length, alignment length and alignment coverage for users to finely tune the clustering process to form more homogeneous clusters.</p>
<p>
<xref ref-type="table" rid="bbs035-T4">Table 4</xref>
lists the clustering results for commonly used reference databases in metagenomic studies at conservative thresholds. Herein, the sizes of the NR data sets are 28–58% of the original ones. After clustering, the size of a NR query data set, which highly depends on the sequencing depth, can often be 50% to many times smaller than the original data set. So overall, the annotation using NR data sets can be easily accelerated by 10-fold.
<table-wrap id="bbs035-T4" position="float">
<label>Table 4:</label>
<caption>
<p>Clustering results of reference databases by CD-HIT package////</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">Data set
<xref ref-type="table-fn" rid="bbs035-TF3">
<sup>a</sup>
</xref>
</th>
<th rowspan="1" colspan="1">Number sequences</th>
<th rowspan="1" colspan="1">Total</th>
<th rowspan="1" colspan="1">Cutoff
<xref ref-type="table-fn" rid="bbs035-TF3">
<sup>b</sup>
</xref>
(%)</th>
<th rowspan="1" colspan="1">Clusters</th>
<th rowspan="1" colspan="1">Reduced to (%)</th>
<th rowspan="1" colspan="1">Time (minutes)
<xref ref-type="table-fn" rid="bbs035-TF3">
<sup>c</sup>
</xref>
</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">NCBI NR</td>
<td rowspan="1" colspan="1">12 054 819</td>
<td rowspan="1" colspan="1">4.3 GB</td>
<td rowspan="1" colspan="1">90</td>
<td rowspan="1" colspan="1">7 036 029</td>
<td rowspan="1" colspan="1">58</td>
<td rowspan="1" colspan="1">181</td>
</tr>
<tr>
<td rowspan="1" colspan="1">16S (Silva + Greengene)</td>
<td rowspan="1" colspan="1">555 530</td>
<td rowspan="1" colspan="1">799 MB</td>
<td rowspan="1" colspan="1">98</td>
<td rowspan="1" colspan="1">154 170</td>
<td rowspan="1" colspan="1">28</td>
<td rowspan="1" colspan="1">90</td>
</tr>
<tr>
<td rowspan="1" colspan="1">NCBI microbial genomes</td>
<td rowspan="1" colspan="1">3 355</td>
<td rowspan="1" colspan="1">6.4 GB</td>
<td rowspan="1" colspan="1">90</td>
<td rowspan="1" colspan="1">1 279</td>
<td rowspan="1" colspan="1">38</td>
<td rowspan="1" colspan="1">389</td>
</tr>
<tr>
<td rowspan="1" colspan="1">NCBI virus sequences</td>
<td rowspan="1" colspan="1">1 042 347</td>
<td rowspan="1" colspan="1">1.3 GB</td>
<td rowspan="1" colspan="1">95</td>
<td rowspan="1" colspan="1">288 701</td>
<td rowspan="1" colspan="1">28</td>
<td rowspan="1" colspan="1">480</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="bbs035-TF3">
<p>
<sup>a</sup>
NCBI NR was downloaded from NCBI at
<ext-link ext-link-type="ftp" xlink:href="ftp://ftp.ncbi.nih.gov/blast/db/FASTA/">ftp://ftp.ncbi.nih.gov/blast/db/FASTA/</ext-link>
. 16S sequences from Silva and Greengene were downloaded from
<ext-link ext-link-type="uri" xlink:href="http://www.arb-silva.de/download/archive/">http://www.arb-silva.de/download/archive/</ext-link>
and
<ext-link ext-link-type="uri" xlink:href="http://greengenes.lbl.gov/Download/Sequence_Data/">http://greengenes.lbl.gov/Download/Sequence_Data/</ext-link>
, respectively. NCBI microbial genomes were downloaded from
<ext-link ext-link-type="ftp" xlink:href="ftp://ftp.ncbi.nih.gov/genomes/Bacteria/">ftp://ftp.ncbi.nih.gov/genomes/Bacteria/</ext-link>
(file: all.fna.tar.gz). NCBI virus sequences were kindly provided by the CAMERA project (Ref. 42).
<sup>b</sup>
Parameters for NR and 16S rRNA are ‘-c 0.9 -n 5 -g 1 -M 0 -T 0’ and ‘-c 0.98 -n 11 -b 5 -M 0 -T 0 -G 1’, respectively. NCBI microbial genomes and virus sequences are clustered by a beta version of CD-HIT that can process very long sequences with parameter ‘-c 0.9’ and ‘-c 0.95’.
<sup>c</sup>
Time on computer with eight cores.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec>
<title>PROTEIN FAMILY IDENTIFICATION</title>
<p>Reference-based metagenome annotation by comparison with known sequences is essential but has drawbacks, with the biggest limitation being the inability to annotate novel sequences. Large metagenomes and those from under-explored environments contain a large number of novel genes, which might be specific to the environment. These novel proteins can well be overlooked by reference-based annotation.</p>
<p>Clustering analysis is the most effective way to discover novel gene families from large data sets. This has been demonstrated by the global ocean sampling (GOS) study, which identified 3995 novel protein clusters from 17.4 million ORFs [
<xref ref-type="bibr" rid="bbs035-B16">16</xref>
]. Other large-scale studies, such as MetaHIT [
<xref ref-type="bibr" rid="bbs035-B9">9</xref>
], also found novel gene families through sequence clustering.</p>
<p>Clustering metagenomic proteins into families is more complicated than creating a NR data set. In both the GOS and MetaHIT projects, the analyses started by removing highly similar sequences (95–98% identity), followed by several steps of protein clustering. In GOS, these steps include (i) calculation of all-against-all similarities using BLAST; (ii) construction of core sequence clusters, which are dense sub-graphs in the whole graph where the vertices are sequences and the edges are defined by a set of very strong similarities cutoffs; (iii) calculation of sequence profiles for large core clusters using FFAS [
<xref ref-type="bibr" rid="bbs035-B74">74</xref>
] and PSI-BLAST [
<xref ref-type="bibr" rid="bbs035-B27">27</xref>
], (iv) creation of protein families by merging core clusters using FFAS profiles and (5) recruitment of small clusters and singletons into large core clusters using PSI-BLAST profiles. In MetaHIT, families were clustered from all-against-all BLAST results with an algorithm called MCL, whose details were not described in the MetaHIT article.</p>
<p>The above clustering pipelines require very time-consuming BLAST calculation (e.g. GOS used 1 million CPU hours), so they are not very feasible for small labs, which now also generate large-scale metagenomic data by using NGS technologies. Earlier, the speed of the GOS clustering pipeline was improved [
<xref ref-type="bibr" rid="bbs035-B75">75</xref>
] by adopting CD-HIT as a fast clustering and recruiting tool. An independent study using only CD-HIT to build protein families from GOS proteins was also introduced later [
<xref ref-type="bibr" rid="bbs035-B15">15</xref>
]. This CD-HIT-based clustering produced comparable results to the original GOS study but only used ∼10 000 CPU hours. Thus, this approach is more suitable for those projects with large data sets that lack the computation resources for exploring protein families.</p>
<p>The above CD-HIT-based protein family finding process has three clustering steps where each subsequent clustering uses the representative sequences generated in the previous step. The similarity thresholds for these three clustering steps are 90%, 60% and 30%, respectively. The first two steps perform regular CD-HIT, and the last step uses PSI-CD-HIT, which also allows an alterative
<italic>e</italic>
value threshold. The details of this method are described in Refs. [
<xref ref-type="bibr" rid="bbs035-B15">15</xref>
] and [
<xref ref-type="bibr" rid="bbs035-B76">76</xref>
], where this method was further improved. The GOS data set is available from CAMERA project at
<ext-link ext-link-type="uri" xlink:href="http://camera.calit2.net">http://camera.calit2.net</ext-link>
. Herein, we demonstrate this easy approach for protein family identification using MetaHIT data. The original 14 792 886 proteins were downloaded from MetaHIT project at from
<ext-link ext-link-type="uri" xlink:href="http://gutmeta.genomics.org.cn/">http://gutmeta.genomics.org.cn/</ext-link>
. These proteins were hierarchically clustered at 90%, 80%, 60% and 30% identity or an
<italic>e</italic>
value of 1e-6. These four steps cost 10, 2, 102 and 720 CPU hours and got 3 076 514; 2 471 148; 1 554 866 and 732 063 clusters, respectively. We used a 4-step clustering for MetaHIT data set, instead of the 3-step clustering we used earlier on the GOS data set, because the 4-step clustering provides better classification accuracy. The cluster distributions of MetaHIT are illustrated along with GOS clusters produced in Ref. [
<xref ref-type="bibr" rid="bbs035-B15">15</xref>
] (
<xref ref-type="fig" rid="bbs035-F4">Figure 4</xref>
A, B). Compared with GOS ORFs, which contain more than half spurious ORFs due to the six reading frame translation, the MetaHIT genes predicted using Metagene [
<xref ref-type="bibr" rid="bbs035-B77">77</xref>
] have far fewer false ORFs. So, using similar clustering parameters, MetaHIT data set has far fewer small clusters and singletons than GOS (
<xref ref-type="fig" rid="bbs035-F4">Figure 4</xref>
A). Cluster distributions for MetaHIT clusters grouped by known and novel are shown in
<xref ref-type="fig" rid="bbs035-F4">Figure 4</xref>
C and D. Herein, novel clusters are those clusters with no detectable similarity to Pfam families [
<xref ref-type="bibr" rid="bbs035-B78">78</xref>
] using HMMER3 [
<xref ref-type="bibr" rid="bbs035-B79">79</xref>
]. The 732 063 MetaHIT clusters contain 20 328 large clusters with at least 20 NR proteins, and 2580 of them are novel (
<xref ref-type="fig" rid="bbs035-F4">Figure 4</xref>
C), which covers ∼9% all MetaHIT sequences (
<xref ref-type="fig" rid="bbs035-F4">Figure 4</xref>
D). These novel clusters may represent human gut-specific gene families and should be further investigated.
<fig id="bbs035-F4" position="float">
<label>Figure 4:</label>
<caption>
<p>Distribution of GOS and MetaHIT protein clusters. The
<italic>x</italic>
-axis is the cluster size
<italic>X</italic>
. The
<italic>y</italic>
-axis in left figures is the number of clusters of size at least
<italic>X</italic>
; the
<italic>y</italic>
-axis in right figures is the percentage of total sequences included in the clusters of size at least
<italic>X</italic>
. Graphs in (
<bold>A</bold>
) and (
<bold>B</bold>
) are for all GOS and MetaHIT sequences. Graphs in (
<bold>C</bold>
) and (
<bold>D</bold>
) are only for MetaHIT sequences, grouped by Known and Novel clusters. In addition, two separate lines are made for NR sequences (i.e. the 3 076 514 representative sequences clustered at 90% identity).</p>
</caption>
<graphic xlink:href="bbs035f4"></graphic>
</fig>
</p>
</sec>
<sec>
<title>LIST OF TOOLS AND THEIR ALGORITHM CHARACTERISTICS</title>
<p>As a summary, the tools tested in this article are listed in
<xref ref-type="table" rid="bbs035-T5">Table 5</xref>
.
<table-wrap id="bbs035-T5" position="float">
<label>Table 5:</label>
<caption>
<p>A list of clustering tools for metagenomic sequence analysis used in this study</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">Tool and reference</th>
<th rowspan="1" colspan="1">Description</th>
<th rowspan="1" colspan="1">Key parameters</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="2" colspan="1">CD-HIT [
<xref ref-type="bibr" rid="bbs035-B28 bbs035-B29 bbs035-B30 bbs035-B31">28–31</xref>
]</td>
<td rowspan="2" colspan="1">Cluster protein sequences</td>
<td rowspan="1" colspan="1">-c identity cutoff</td>
</tr>
<tr>
<td rowspan="1" colspan="1">-n word size</td>
</tr>
<tr>
<td rowspan="2" colspan="1">CD-HIT-EST [
<xref ref-type="bibr" rid="bbs035-B28 bbs035-B29 bbs035-B30 bbs035-B31">28–31</xref>
]</td>
<td rowspan="2" colspan="1">Cluster nucleotide sequences</td>
<td rowspan="1" colspan="1">-c identity cutoff</td>
</tr>
<tr>
<td rowspan="1" colspan="1">-n word size</td>
</tr>
<tr>
<td rowspan="2" colspan="1">Uclust [
<xref ref-type="bibr" rid="bbs035-B39">39</xref>
]</td>
<td rowspan="2" colspan="1">Cluster protein or nucleotide sequences</td>
<td rowspan="1" colspan="1">-id identity cutoff</td>
</tr>
<tr>
<td rowspan="1" colspan="1">–w word size</td>
</tr>
<tr>
<td rowspan="1" colspan="1">SEED [
<xref ref-type="bibr" rid="bbs035-B41">41</xref>
]</td>
<td rowspan="1" colspan="1">Cluster highly similar Illumina reads (up to 3 mismatches and overhanging bases)</td>
<td rowspan="1" colspan="1">–mismatch allowed mismatches</td>
</tr>
<tr>
<td rowspan="2" colspan="1">DNACLUST [
<xref ref-type="bibr" rid="bbs035-B40">40</xref>
]</td>
<td rowspan="2" colspan="1">Cluster highly similar DNA sequences (e.g. 16S rRNAs)</td>
<td rowspan="1" colspan="1">-s similarity cutoff</td>
</tr>
<tr>
<td rowspan="1" colspan="1">-k word size</td>
</tr>
<tr>
<td rowspan="1" colspan="1">CD-HIT-454 [
<xref ref-type="bibr" rid="bbs035-B13">13</xref>
]</td>
<td rowspan="1" colspan="1">Identify duplicates for 454 reads</td>
<td rowspan="1" colspan="1">-c identity cutoff</td>
</tr>
<tr>
<td rowspan="1" colspan="1">CD-HIT-DUP</td>
<td rowspan="1" colspan="1">Identify duplicates for single or pair-ended Illumina reads</td>
<td rowspan="1" colspan="1">-e allowed mismatches</td>
</tr>
<tr>
<td rowspan="2" colspan="1">CD-HIT-LAP</td>
<td rowspan="2" colspan="1">Identify overlapping Illumina reads</td>
<td rowspan="1" colspan="1">-m overlapping length</td>
</tr>
<tr>
<td rowspan="1" colspan="1">-p overlapping coverage</td>
</tr>
<tr>
<td rowspan="2" colspan="1">PSI-CD-HIT [
<xref ref-type="bibr" rid="bbs035-B28 bbs035-B29 bbs035-B30 bbs035-B31">28–31</xref>
]</td>
<td rowspan="2" colspan="1">Cluster proteins at low identity cutoff (20–50%)</td>
<td rowspan="1" colspan="1">-c identity cutoff</td>
</tr>
<tr>
<td rowspan="1" colspan="1">-ce expect value cutoff</td>
</tr>
<tr>
<td rowspan="1" colspan="1">CD-HIT-OTU</td>
<td rowspan="1" colspan="1">Identify operational taxonomic units (OTUs) from rRNAs</td>
<td rowspan="1" colspan="1">Identity cutoff
<xref ref-type="table-fn" rid="bbs035-TF13">
<sup>a</sup>
</xref>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">AmpliconNoise [
<xref ref-type="bibr" rid="bbs035-B56">56</xref>
]</td>
<td rowspan="1" colspan="1">Cluster flowgram data to remove noises from reads for OTU clustering</td>
<td rowspan="1" colspan="1">Identity cutoff
<xref ref-type="table-fn" rid="bbs035-TF13">
<sup>a</sup>
</xref>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Denoiser [
<xref ref-type="bibr" rid="bbs035-B55">55</xref>
]</td>
<td rowspan="1" colspan="1">Cluster flowgram data to remove noises from reads for OTU clustering</td>
<td rowspan="1" colspan="1">Identity cutoff
<xref ref-type="table-fn" rid="bbs035-TF13">
<sup>a</sup>
</xref>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Cluster-based filtering</td>
<td rowspan="1" colspan="1">Filter sequence errors for improved sequence assembly</td>
<td rowspan="1" colspan="1">See CD-HIT-EST</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Protein family clustering [
<xref ref-type="bibr" rid="bbs035-B15">15</xref>
]</td>
<td rowspan="1" colspan="1">Identify protein families from metagenomic sequences</td>
<td rowspan="1" colspan="1">See CD-HIT and PSI-CD-HIT</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="bbs035-TF13">
<p>
<sup>a</sup>
CD-HIT-OTU, AmpliconNoise and Denoiser have multiple steps involves many parameters, which usually do not need to be modified.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
<p>The key parameters of these programs include the clustering similarity cutoff and some algorithmic parameters. For clustering similarity cutoff, CD-HIT, CD-HIT-EST, Uclust, CD-HIT-454, PSI-CD-HIT and OTU clustering packages use sequence identity; DNACLUST uses a similarity cutoff based on edit distance, which is very similar to sequence identity; SEED and CD-HIT-DUP allow certain number of mismatches. Word size is the most important algorithmic parameter for many programs (
<xref ref-type="table" rid="bbs035-T5">Table 5</xref>
). The choice of word size depends on the clustering similarity cutoff and the type of sequences (protein or DNA). A higher similarity cutoff works with a longer word, which yields higher clustering speed. An important property of clustering methods is whether the results will change when the order of inputted sequences is different. Most methods introduced herein, including programs in CD-HIT package, Uclust and DNACLUST sort sequences by length and process them from long to short. The OTU clustering packages (e.g. CD-HIT-OTU) sort sequences by abundance and process them from high to low. So the order of inputted sequences does not change the output clusters except when the inputted sequences of the same length (or abundance) are in different order. Reads in most Illumina data sets have identical length, so the clustering results of Illumina reads depend on the order of inputted sequences.</p>
<p>
<boxed-text id="bbs035-BOX1" position="float">
<caption>
<title>Key Points</title>
</caption>
<p>
<list list-type="bullet">
<list-item>
<p>Sequence clustering is an effective method to answer and address many fundamental questions and challenges in metagenomics. The applications include but are not limited to finding duplicates, diversity analyses, filtering sequence errors, database searches and finding protein families.</p>
</list-item>
<list-item>
<p>Ultra-fast clustering methods, such as CD-HIT, use less accurate algorithms than some sophisticated algorithms that rely on all-against-all similarities. However, when being used intelligently (e.g. multi-step clustering using parameters that fit a sequencing error model), the ultra-fast methods can produce comparable results to those sophisticated methods and can still be orders of magnitude faster.</p>
</list-item>
<list-item>
<p>Artificial duplicates should be removed for correct abundance calculation. However, attention should be paid to high-abundance viral and transcriptomic samples, where natural duplicates may be more abundant than artificial ones.</p>
</list-item>
<list-item>
<p>Using NR data sets saves significant database search time in metagenome annotation. However, conservative clustering parameters need to be used to ensure the clusters are homogeneous according to the annotation goal.</p>
</list-item>
<list-item>
<p>Clustering analysis is effective in finding novel gene families that might be overlooked using only reference-based annotation. Multi-step hierarchical clustering using ultra-fast methods can rapidly produce protein families from very large data sets.</p>
</list-item>
</list>
</p>
</boxed-text>
</p>
</sec>
<sec>
<title>FUNDING</title>
<p>
<funding-source>National Institutes of Health</funding-source>
[
<award-id>R01RR025030</award-id>
and
<award-id>R01HG005978</award-id>
to W.L.] and the
<funding-source>Gordon and Betty Moore Foundation</funding-source>
[CAMERA project].</p>
</sec>
</body>
<back>
<bio id="d34e36">
<p>
<bold>Weizhong Li</bold>
is an Associate Research Scientist at the Center for Research in Biological Systems at University of California San Diego. Dr. Li has a background in computational biology. His research focuses on developing computational methods for sequence, genomic and metagenomic data analysis.</p>
</bio>
<bio id="d34e47">
<p>
<bold>Limin Fu</bold>
is a Postdoctoral Associate at the Center for Research in Biological Systems at University of California San Diego. Dr. Fu’s background is mathematics. His research focuses on bioinformatics algorithm development.</p>
</bio>
<bio id="d34e58">
<p>
<bold>Beifang Niu</bold>
is a Postdoctoral Associate at the Center for Research in Biological Systems at University of California San Diego. Dr. Niu was trained as a computer scientist. His research focuses on next-generation sequence analysis.</p>
</bio>
<bio id="d34e69">
<p>
<bold>Sitao Wu</bold>
is a Staff Scientist at the Center for Research in Biological Systems at University of California San Diego. Dr. Wu has a background in electric engineering. His research interests include protein structure prediction and metagenomics.</p>
</bio>
<bio id="d34e80">
<p>
<bold>John Wooley</bold>
is a Professor of Pharmacology and Associate Vice Chancellor, Research at the University of California San Diego, as well as a member of the Center for Research in Biological Systems and the California Institute of Telecommunications and Information Technology. Dr. Wooley’s background is biophysics and his current research interests include structural genomics and metagenomics.</p>
</bio>
<ref-list>
<title>References</title>
<ref id="bbs035-B1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Handelsman</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Metagenomics: application of genomics to uncultured microorganisms</article-title>
<source>Microbiol Mol Biol Rev</source>
<year>2004</year>
<volume>68</volume>
<fpage>669</fpage>
<lpage>85</lpage>
<pub-id pub-id-type="pmid">15590779</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B2">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wooley</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Godzik</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Friedberg</surname>
<given-names>I</given-names>
</name>
</person-group>
<article-title>A primer on metagenomics</article-title>
<source>PLoS Comput Biol</source>
<year>2010</year>
<volume>6</volume>
<fpage>e1000667</fpage>
<pub-id pub-id-type="pmid">20195499</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B3">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Venter</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Remington</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Heidelberg</surname>
<given-names>JF</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Environmental genome shotgun sequencing of the Sargasso Sea</article-title>
<source>Science</source>
<year>2004</year>
<volume>304</volume>
<fpage>66</fpage>
<lpage>74</lpage>
<pub-id pub-id-type="pmid">15001713</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B4">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gill</surname>
<given-names>SR</given-names>
</name>
<name>
<surname>Pop</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Deboy</surname>
<given-names>RT</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Metagenomic analysis of the human distal gut microbiome</article-title>
<source>Science</source>
<year>2006</year>
<volume>312</volume>
<fpage>1355</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="pmid">16741115</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B5">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tringe</surname>
<given-names>SG</given-names>
</name>
<name>
<surname>von Mering</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Kobayashi</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Comparative metagenomics of microbial communities</article-title>
<source>Science</source>
<year>2005</year>
<volume>308</volume>
<fpage>554</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="pmid">15845853</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B6">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mardis</surname>
<given-names>ER</given-names>
</name>
</person-group>
<article-title>A decade's perspective on DNA sequencing technology</article-title>
<source>Nature</source>
<year>2011</year>
<volume>470</volume>
<fpage>198</fpage>
<lpage>203</lpage>
<pub-id pub-id-type="pmid">21307932</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B7">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dinsdale</surname>
<given-names>EA</given-names>
</name>
<name>
<surname>Edwards</surname>
<given-names>RA</given-names>
</name>
<name>
<surname>Hall</surname>
<given-names>D</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Functional metagenomic profiling of nine biomes</article-title>
<source>Nature</source>
<year>2008</year>
<volume>452</volume>
<fpage>629</fpage>
<lpage>32</lpage>
<pub-id pub-id-type="pmid">18337718</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B8">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hess</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Sczyrba</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Egan</surname>
<given-names>R</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Metagenomic discovery of biomass-degrading genes and genomes from cow rumen</article-title>
<source>Science</source>
<year>2011</year>
<volume>331</volume>
<fpage>463</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="pmid">21273488</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B9">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qin</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Raes</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<article-title>A human gut microbial gene catalogue established by metagenomic sequencing</article-title>
<source>Nature</source>
<year>2010</year>
<volume>464</volume>
<fpage>59</fpage>
<lpage>65</lpage>
<pub-id pub-id-type="pmid">20203603</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B10">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Peterson</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Garges</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Giovanni</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The NIH human microbiome project</article-title>
<source>Genome Res</source>
<year>2009</year>
<volume>19</volume>
<fpage>2317</fpage>
<lpage>23</lpage>
<pub-id pub-id-type="pmid">19819907</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B11">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chitsaz</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Yee-Greenbaum</surname>
<given-names>JL</given-names>
</name>
<name>
<surname>Tesler</surname>
<given-names>G</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Efficient de novo assembly of single-cell bacterial genomes from short-read data sets</article-title>
<source>Nat Biotechnol</source>
<year>2011</year>
<volume>29</volume>
<fpage>915</fpage>
<lpage>21</lpage>
<pub-id pub-id-type="pmid">21926975</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B12">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gomez-Alvarez</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Teal</surname>
<given-names>TK</given-names>
</name>
<name>
<surname>Schmidt</surname>
<given-names>TM</given-names>
</name>
</person-group>
<article-title>Systematic artifacts in metagenomes from complex microbial communities</article-title>
<source>ISME J</source>
<year>2009</year>
<volume>3</volume>
<fpage>1314</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="pmid">19587772</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B13">
<label>13</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Niu</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Fu</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Artificial and natural duplicates in pyrosequencing reads of metagenomic data</article-title>
<source>BMC Bioinformatics</source>
<year>2010</year>
<volume>11</volume>
<fpage>187</fpage>
<pub-id pub-id-type="pmid">20388221</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B14">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schloss</surname>
<given-names>PD</given-names>
</name>
<name>
<surname>Westcott</surname>
<given-names>SL</given-names>
</name>
<name>
<surname>Ryabin</surname>
<given-names>T</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities</article-title>
<source>Appl Environ Microbiol</source>
<year>2009</year>
<volume>75</volume>
<fpage>7537</fpage>
<lpage>41</lpage>
<pub-id pub-id-type="pmid">19801464</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B15">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Wooley</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Godzik</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Probing metagenomics by rapid cluster analysis of very large datasets</article-title>
<source>PLoS ONE</source>
<year>2008</year>
<volume>3</volume>
<fpage>e3375</fpage>
<pub-id pub-id-type="pmid">18846219</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B16">
<label>16</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yooseph</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sutton</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Rusch</surname>
<given-names>DB</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The sorcerer II global ocean sampling expedition: expanding the universe of protein families</article-title>
<source>PLoS Biol</source>
<year>2007</year>
<volume>5</volume>
<fpage>e16</fpage>
<pub-id pub-id-type="pmid">17355171</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B17">
<label>17</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gilbert</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Field</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>Y</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Detection of large numbers of novel sequences in the metatranscriptomes of complex marine microbial communities</article-title>
<source>PLoS ONE</source>
<year>2008</year>
<volume>3</volume>
<fpage>e3042</fpage>
<pub-id pub-id-type="pmid">18725995</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B18">
<label>18</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yona</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Linial</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Linial</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>ProtoMap: automatic classification of protein sequences and hierarchy of protein families</article-title>
<source>Nucleic Acids Res</source>
<year>2000</year>
<volume>28</volume>
<fpage>49</fpage>
<lpage>55</lpage>
<pub-id pub-id-type="pmid">10592179</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B19">
<label>19</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sasson</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Vaaknin</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Fleischer</surname>
<given-names>H</given-names>
</name>
<etal></etal>
</person-group>
<article-title>ProtoNet: hierarchical classification of the protein space</article-title>
<source>Nucleic Acids Res</source>
<year>2003</year>
<volume>31</volume>
<fpage>348</fpage>
<lpage>52</lpage>
<pub-id pub-id-type="pmid">12520020</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B20">
<label>20</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Park</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Holm</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Heger</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
<article-title>RSDB: representative protein sequence databases have high information content</article-title>
<source>Bioinformatics</source>
<year>2000</year>
<volume>16</volume>
<fpage>458</fpage>
<lpage>64</lpage>
<pub-id pub-id-type="pmid">10871268</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B21">
<label>21</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Enright</surname>
<given-names>AJ</given-names>
</name>
<name>
<surname>Ouzounis</surname>
<given-names>CA</given-names>
</name>
</person-group>
<article-title>GeneRAGE: a robust algorithm for sequence clustering and domain detection</article-title>
<source>Bioinformatics</source>
<year>2000</year>
<volume>16</volume>
<fpage>451</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="pmid">10871267</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B22">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Enright</surname>
<given-names>AJ</given-names>
</name>
<name>
<surname>Van Dongen</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Ouzounis</surname>
<given-names>CA</given-names>
</name>
</person-group>
<article-title>An efficient algorithm for large-scale detection of protein families</article-title>
<source>Nucleic Acids Res</source>
<year>2002</year>
<volume>30</volume>
<fpage>1575</fpage>
<lpage>84</lpage>
<pub-id pub-id-type="pmid">11917018</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B23">
<label>23</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pipenbacher</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Schliep</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Schneckener</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>
<article-title>ProClust: improved clustering of protein sequences with an extended graph-based approach</article-title>
<source>Bioinformatics</source>
<year>2002</year>
<volume>18</volume>
<issue>Suppl. 2</issue>
<fpage>S182</fpage>
<lpage>91</lpage>
<pub-id pub-id-type="pmid">12386002</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B24">
<label>24</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mika</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Rost</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>UniqueProt: Creating representative protein sequence sets</article-title>
<source>Nucleic Acids Res</source>
<year>2003</year>
<volume>31</volume>
<fpage>3789</fpage>
<lpage>91</lpage>
<pub-id pub-id-type="pmid">12824419</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B25">
<label>25</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Stoeckert</surname>
<given-names>CJ</given-names>
<suffix>Jr</suffix>
</name>
<name>
<surname>Roos</surname>
<given-names>DS</given-names>
</name>
</person-group>
<article-title>OrthoMCL: identification of ortholog groups for eukaryotic genomes</article-title>
<source>Genome Res</source>
<year>2003</year>
<volume>13</volume>
<fpage>2178</fpage>
<lpage>89</lpage>
<pub-id pub-id-type="pmid">12952885</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B26">
<label>26</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Loewenstein</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Portugaly</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Fromer</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space</article-title>
<source>Bioinformatics</source>
<year>2008</year>
<volume>24</volume>
<fpage>i41</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="pmid">18586742</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B27">
<label>27</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name>
<surname>Madden</surname>
<given-names>TL</given-names>
</name>
<name>
<surname>Schaffer</surname>
<given-names>AA</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs</article-title>
<source>Nucleic Acids Res</source>
<year>1997</year>
<volume>25</volume>
<fpage>3389</fpage>
<lpage>402</lpage>
<pub-id pub-id-type="pmid">9254694</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B28">
<label>28</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Niu</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>Y</given-names>
</name>
<etal></etal>
</person-group>
<article-title>CD-HIT Suite: a web server for clustering and comparing biological sequences</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<fpage>680</fpage>
<lpage>2</lpage>
<pub-id pub-id-type="pmid">20053844</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B29">
<label>29</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>WZ</given-names>
</name>
<name>
<surname>Godzik</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences</article-title>
<source>Bioinformatics</source>
<year>2006</year>
<volume>22</volume>
<fpage>1658</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="pmid">16731699</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B30">
<label>30</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>WZ</given-names>
</name>
<name>
<surname>Jaroszewski</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Godzik</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Clustering of highly homologous sequences to reduce the size of large protein databases</article-title>
<source>Bioinformatics</source>
<year>2001</year>
<volume>17</volume>
<fpage>282</fpage>
<lpage>3</lpage>
<pub-id pub-id-type="pmid">11294794</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B31">
<label>31</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>WZ</given-names>
</name>
<name>
<surname>Jaroszewski</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Godzik</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Tolerating some redundancy significantly speeds up clustering of large protein databases</article-title>
<source>Bioinformatics</source>
<year>2002</year>
<volume>18</volume>
<fpage>77</fpage>
<lpage>82</lpage>
<pub-id pub-id-type="pmid">11836214</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B32">
<label>32</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Boguski</surname>
<given-names>MS</given-names>
</name>
<name>
<surname>Schuler</surname>
<given-names>GD</given-names>
</name>
</person-group>
<article-title>ESTablishing a human transcript map</article-title>
<source>Nature Genetics</source>
<year>1995</year>
<volume>10</volume>
<fpage>369</fpage>
<lpage>71</lpage>
<pub-id pub-id-type="pmid">7670480</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B33">
<label>33</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pertea</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Liang</surname>
<given-names>F</given-names>
</name>
<etal></etal>
</person-group>
<article-title>TIGR Gene indices clustering tools (TGICL): a software system for fast clustering of large EST datasets</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>19</volume>
<fpage>651</fpage>
<lpage>2</lpage>
<pub-id pub-id-type="pmid">12651724</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B34">
<label>34</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Burke</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Davison</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Hide</surname>
<given-names>W</given-names>
</name>
</person-group>
<article-title>d2_cluster: a validated method for clustering EST and full-length cDNAsequences</article-title>
<source>Genome Res</source>
<year>1999</year>
<volume>9</volume>
<fpage>1135</fpage>
<lpage>42</lpage>
<pub-id pub-id-type="pmid">10568753</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B35">
<label>35</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Malde</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Coward</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Jonassen</surname>
<given-names>I</given-names>
</name>
</person-group>
<article-title>Fast sequence clustering using a suffix array algorithm</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>19</volume>
<fpage>1221</fpage>
<pub-id pub-id-type="pmid">12835265</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B36">
<label>36</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ptitsyn</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Hide</surname>
<given-names>W</given-names>
</name>
</person-group>
<article-title>CLU: a new algorithm for EST clustering</article-title>
<source>BMC Bioinformatics</source>
<year>2005</year>
<volume>6</volume>
<issue>Suppl. 2</issue>
<fpage>S3</fpage>
<pub-id pub-id-type="pmid">16026600</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B37">
<label>37</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hazelhurst</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Lipták</surname>
<given-names>Z</given-names>
</name>
</person-group>
<article-title>KABOOM! A new suffix-array based algorithm for clustering expression data</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<fpage>3348</fpage>
<lpage>55</lpage>
<pub-id pub-id-type="pmid">21984769</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B38">
<label>38</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Suzek</surname>
<given-names>BE</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>HZ</given-names>
</name>
<name>
<surname>McGarvey</surname>
<given-names>P</given-names>
</name>
<etal></etal>
</person-group>
<article-title>UniRef: comprehensive and non-redundant UniProt reference clusters</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<fpage>1282</fpage>
<lpage>8</lpage>
<pub-id pub-id-type="pmid">17379688</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B39">
<label>39</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Edgar</surname>
<given-names>RC</given-names>
</name>
</person-group>
<article-title>Search and clustering orders of magnitude faster than BLAST</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<fpage>2460</fpage>
<lpage>1</lpage>
<pub-id pub-id-type="pmid">20709691</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B40">
<label>40</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ghodsi</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Pop</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>DNACLUST: accurate and efficient clustering of phylogenetic marker genes</article-title>
<source>BMC Bioinformatics</source>
<year>2011</year>
<volume>12</volume>
<fpage>271</fpage>
<pub-id pub-id-type="pmid">21718538</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B41">
<label>41</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bao</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Kaloshian</surname>
<given-names>I</given-names>
</name>
<etal></etal>
</person-group>
<article-title>SEED: efficient clustering of next-generation sequences</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<fpage>2502</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="pmid">21810899</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B42">
<label>42</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sun</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>W</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource</article-title>
<source>Nucleic Acids Res</source>
<year>2011</year>
<volume>39</volume>
<fpage>D546</fpage>
<lpage>51</lpage>
<pub-id pub-id-type="pmid">21045053</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B43">
<label>43</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Fu</surname>
<given-names>L</given-names>
</name>
<etal></etal>
</person-group>
<article-title>WebMGA: a customizable web server for fast metagenomic sequence analysis</article-title>
<source>BMC Genomics</source>
<year>2011</year>
<volume>12</volume>
<fpage>444</fpage>
<pub-id pub-id-type="pmid">21899761</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B44">
<label>44</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Costello</surname>
<given-names>EK</given-names>
</name>
<name>
<surname>Lauber</surname>
<given-names>CL</given-names>
</name>
<name>
<surname>Hamady</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Bacterial community variation in human body habitats across space and time</article-title>
<source>Science</source>
<year>2009</year>
<volume>326</volume>
<fpage>1694</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="pmid">19892944</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B45">
<label>45</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Meyer</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Paarmann</surname>
<given-names>D</given-names>
</name>
<name>
<surname>D'Souza</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The metagenomics RAST server—a public resource for the automatic phylogenetic and functional analysis of metagenomes</article-title>
<source>BMC Bioinformatics</source>
<year>2008</year>
<volume>9</volume>
<fpage>386</fpage>
<pub-id pub-id-type="pmid">18803844</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B46">
<label>46</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Markowitz</surname>
<given-names>VM</given-names>
</name>
<name>
<surname>Ivanova</surname>
<given-names>NN</given-names>
</name>
<name>
<surname>Szeto</surname>
<given-names>E</given-names>
</name>
<etal></etal>
</person-group>
<article-title>IMG/M: a data management and analysis system for metagenomes</article-title>
<source>Nucleic Acids Res</source>
<year>2008</year>
<volume>36</volume>
<fpage>D534</fpage>
<lpage>8</lpage>
<pub-id pub-id-type="pmid">17932063</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B47">
<label>47</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rusch</surname>
<given-names>DB</given-names>
</name>
<name>
<surname>Halpern</surname>
<given-names>AL</given-names>
</name>
<name>
<surname>Sutton</surname>
<given-names>G</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The sorcerer II global ocean sampling expedition: northwest atlantic through eastern tropical pacific</article-title>
<source>PLoS Biol</source>
<year>2007</year>
<volume>5</volume>
<fpage>e77</fpage>
<pub-id pub-id-type="pmid">17355176</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B48">
<label>48</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Turnbaugh</surname>
<given-names>PJ</given-names>
</name>
<name>
<surname>Hamady</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Yatsunenko</surname>
<given-names>T</given-names>
</name>
<etal></etal>
</person-group>
<article-title>A core gut microbiome in obese and lean twins</article-title>
<source>Nature</source>
<year>2009</year>
<volume>457</volume>
<fpage>U480</fpage>
<lpage>7</lpage>
</element-citation>
</ref>
<ref id="bbs035-B49">
<label>49</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schloss</surname>
<given-names>PD</given-names>
</name>
<name>
<surname>Handelsman</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness</article-title>
<source>Appl Environ Microbiol</source>
<year>2005</year>
<volume>71</volume>
<fpage>1501</fpage>
<lpage>6</lpage>
<pub-id pub-id-type="pmid">15746353</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B50">
<label>50</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sun</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Cai</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Huse</surname>
<given-names>SM</given-names>
</name>
<etal></etal>
</person-group>
<article-title>A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis</article-title>
<source>Brief Bioinform</source>
<year>2012</year>
<volume>13</volume>
<fpage>107</fpage>
<lpage>21</lpage>
<pub-id pub-id-type="pmid">21525143</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B51">
<label>51</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sun</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Cai</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>L</given-names>
</name>
<etal></etal>
</person-group>
<article-title>ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences</article-title>
<source>Nucleic Acids Res</source>
<year>2009</year>
<volume>37</volume>
<fpage>e76</fpage>
<pub-id pub-id-type="pmid">19417062</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B52">
<label>52</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huse</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Welch</surname>
<given-names>DM</given-names>
</name>
<name>
<surname>Morrison</surname>
<given-names>HG</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Ironing out the wrinkles in the rare biosphere through improved OTU clustering</article-title>
<source>Environ Microbiol</source>
<year>2010</year>
<volume>12</volume>
<fpage>1889</fpage>
<lpage>98</lpage>
<pub-id pub-id-type="pmid">20236171</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B53">
<label>53</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kunin</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Engelbrektson</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ochman</surname>
<given-names>H</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates</article-title>
<source>Environ Microbiol</source>
<year>2010</year>
<volume>12</volume>
<fpage>118</fpage>
<lpage>23</lpage>
<pub-id pub-id-type="pmid">19725865</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B54">
<label>54</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Quince</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Lanzen</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Curtis</surname>
<given-names>TP</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Accurate determination of microbial diversity from 454 pyrosequencing data</article-title>
<source>Nat Methods</source>
<year>2009</year>
<volume>6</volume>
<fpage>639</fpage>
<lpage>41</lpage>
<pub-id pub-id-type="pmid">19668203</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B55">
<label>55</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Reeder</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Knight</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions</article-title>
<source>Nat Methods</source>
<year>2010</year>
<volume>7</volume>
<fpage>668</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="pmid">20805793</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B56">
<label>56</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Quince</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Lanzen</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Davenport</surname>
<given-names>RJ</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Removing noise from pyrosequenced amplicons</article-title>
<source>BMC Bioinformatics</source>
<year>2011</year>
<volume>12</volume>
<fpage>38</fpage>
<pub-id pub-id-type="pmid">21276213</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B57">
<label>57</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Caporaso</surname>
<given-names>JG</given-names>
</name>
<name>
<surname>Kuczynski</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Stombaugh</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<article-title>QIIME allows analysis of high-throughput community sequencing data</article-title>
<source>Nat Methods</source>
<year>2010</year>
<volume>7</volume>
<fpage>335</fpage>
<lpage>6</lpage>
<pub-id pub-id-type="pmid">20383131</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B58">
<label>58</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qu</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Hashimoto</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Morishita</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing</article-title>
<source>Genome Res</source>
<year>2009</year>
<volume>19</volume>
<fpage>1309</fpage>
<lpage>15</lpage>
<pub-id pub-id-type="pmid">19439514</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B59">
<label>59</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Palmer</surname>
<given-names>LE</given-names>
</name>
<name>
<surname>Bolanos</surname>
<given-names>R</given-names>
</name>
<etal></etal>
</person-group>
<article-title>EDAR: an efficient error detection and removal algorithm for next generation sequencing data</article-title>
<source>J Comput Biol</source>
<year>2010</year>
<volume>17</volume>
<fpage>1549</fpage>
<lpage>60</lpage>
<pub-id pub-id-type="pmid">20973743</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B60">
<label>60</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Medvedev</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Scott</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Kakaradov</surname>
<given-names>B</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Error correction of high-throughput sequencing datasets with non-uniform coverage</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<fpage>i137</fpage>
<lpage>41</lpage>
<pub-id pub-id-type="pmid">21685062</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B61">
<label>61</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kao</surname>
<given-names>WC</given-names>
</name>
<name>
<surname>Chan</surname>
<given-names>AH</given-names>
</name>
<name>
<surname>Song</surname>
<given-names>YS</given-names>
</name>
</person-group>
<article-title>ECHO: a reference-free short-read error correction algorithm</article-title>
<source>Genome Res</source>
<year>2011</year>
<volume>21</volume>
<fpage>1181</fpage>
<lpage>92</lpage>
<pub-id pub-id-type="pmid">21482625</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B62">
<label>62</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y</given-names>
</name>
<etal></etal>
</person-group>
<article-title>SOAP2: an improved ultrafast tool for short read alignment</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<fpage>1966</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="pmid">19497933</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B63">
<label>63</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zerbino</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Birney</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Velvet: algorithms for de novo short read assembly using de Bruijn graphs</article-title>
<source>Genome Res</source>
<year>2008</year>
<volume>18</volume>
<fpage>821</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="pmid">18349386</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B64">
<label>64</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Ruan</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<article-title>De novo assembly of human genomes with massively parallel short read sequencing</article-title>
<source>Genome Res</source>
<year>2010</year>
<volume>20</volume>
<fpage>265</fpage>
<lpage>72</lpage>
<pub-id pub-id-type="pmid">20019144</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B65">
<label>65</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Durbin</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Fast and accurate short read alignment with Burrows-Wheeler transform</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<fpage>1754</fpage>
<lpage>60</lpage>
<pub-id pub-id-type="pmid">19451168</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B66">
<label>66</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kent</surname>
<given-names>WJ</given-names>
</name>
</person-group>
<article-title>BLAT–the BLAST-like alignment tool</article-title>
<source>Genome Res</source>
<year>2002</year>
<volume>12</volume>
<fpage>656</fpage>
<lpage>64</lpage>
<pub-id pub-id-type="pmid">11932250</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B67">
<label>67</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Niu</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Fu</surname>
<given-names>L</given-names>
</name>
<etal></etal>
</person-group>
<article-title>FR-HIT, a very fast program to recruit metagenomic reads to homologous reference genomes</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<fpage>1704</fpage>
<lpage>5</lpage>
<pub-id pub-id-type="pmid">21505035</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B68">
<label>68</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ye</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Choi</surname>
<given-names>JH</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>RAPSearch: a fast protein similarity search tool for short reads</article-title>
<source>BMC Bioinformatics</source>
<year>2011</year>
<volume>12</volume>
<fpage>159</fpage>
<pub-id pub-id-type="pmid">21575167</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B69">
<label>69</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>WZ</given-names>
</name>
<name>
<surname>Jaroszewski</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Godzik</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Sequence clustering strategies improve remote homology recognitions while reducing search times</article-title>
<source>Protein Eng</source>
<year>2002</year>
<volume>15</volume>
<fpage>643</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="pmid">12364578</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B70">
<label>70</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pruesse</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Quast</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Knittel</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
<article-title>SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB</article-title>
<source>Nucleic Acids Res</source>
<year>2007</year>
<volume>35</volume>
<fpage>7188</fpage>
<lpage>96</lpage>
<pub-id pub-id-type="pmid">17947321</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B71">
<label>71</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cole</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Cardenas</surname>
<given-names>E</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The Ribosomal Database Project: improved alignments and new tools for rRNA analysis</article-title>
<source>Nucleic Acids Res</source>
<year>2009</year>
<volume>37</volume>
<fpage>D141</fpage>
<lpage>5</lpage>
<pub-id pub-id-type="pmid">19004872</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B72">
<label>72</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>DeSantis</surname>
<given-names>TZ</given-names>
</name>
<name>
<surname>Hugenholtz</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Larsen</surname>
<given-names>N</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB</article-title>
<source>Appl Environ Microbiol</source>
<year>2006</year>
<volume>72</volume>
<fpage>5069</fpage>
<lpage>72</lpage>
<pub-id pub-id-type="pmid">16820507</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B73">
<label>73</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ogata</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Goto</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sato</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
<article-title>KEGG: Kyoto Encyclopedia of Genes and Genomes</article-title>
<source>Nucleic Acids Res</source>
<year>1999</year>
<volume>27</volume>
<fpage>29</fpage>
<lpage>34</lpage>
<pub-id pub-id-type="pmid">9847135</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B74">
<label>74</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rychlewski</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Jaroszewski</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>WZ</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Comparison of sequence profiles. Strategies for structural predictions using sequence information</article-title>
<source>Protein Sci</source>
<year>2000</year>
<volume>9</volume>
<fpage>232</fpage>
<lpage>41</lpage>
<pub-id pub-id-type="pmid">10716175</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B75">
<label>75</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yooseph</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Sutton</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering</article-title>
<source>BMC Bioinformatics</source>
<year>2008</year>
<volume>9</volume>
<fpage>182</fpage>
<pub-id pub-id-type="pmid">18402669</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B76">
<label>76</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>W</given-names>
</name>
</person-group>
<article-title>Analysis and comparison of very large metagenomes with fast clustering and functional annotation</article-title>
<source>BMC Bioinformatics</source>
<year>2009</year>
<volume>10</volume>
<fpage>359</fpage>
<pub-id pub-id-type="pmid">19863816</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B77">
<label>77</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Noguchi</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Takagi</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>MetaGene: prokaryotic gene finding from environmental genome shotgun sequences</article-title>
<source>Nucleic Acids Res</source>
<year>2006</year>
<volume>34</volume>
<fpage>5623</fpage>
<lpage>30</lpage>
<pub-id pub-id-type="pmid">17028096</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B78">
<label>78</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Finn</surname>
<given-names>RD</given-names>
</name>
<name>
<surname>Mistry</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Tate</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The Pfam protein families database</article-title>
<source>Nucleic Acids Res</source>
<year>2010</year>
<volume>38</volume>
<fpage>D211</fpage>
<lpage>22</lpage>
<pub-id pub-id-type="pmid">19920124</pub-id>
</element-citation>
</ref>
<ref id="bbs035-B79">
<label>79</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Eddy</surname>
<given-names>SR</given-names>
</name>
</person-group>
<article-title>A new generation of homology search tools based on probabilistic inference</article-title>
<source>Genome Inform</source>
<year>2009</year>
<volume>23</volume>
<fpage>205</fpage>
<lpage>11</lpage>
<pub-id pub-id-type="pmid">20180275</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/CyberinfraV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000507  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000507  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    CyberinfraV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.25.
Data generation: Thu Oct 27 09:30:58 2016. Site generation: Sun Mar 10 23:08:40 2024