Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

kClust: fast and sensitive clustering of large protein sequence databases

Identifieur interne : 000934 ( Pmc/Corpus ); précédent : 000933; suivant : 000935

kClust: fast and sensitive clustering of large protein sequence databases

Auteurs : Maria Hauser ; Christian E. Mayer ; Johannes Söding

Source :

RBID : PMC:3843501

Abstract

Background

Fueled by rapid progress in high-throughput sequencing, the size of public sequence databases doubles every two years. Searching the ever larger and more redundant databases is getting increasingly inefficient. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed, sensitivity, and readability of homology searches. However, because the clustering time is quadratic in the number of sequences, standard sequence search methods are becoming impracticable.

Results

Here we present a method to cluster large protein sequence databases such as UniProt within days down to 20%–30% maximum pairwise sequence identity. kClust owes its speed and sensitivity to an alignment-free prefilter that calculates the cumulative score of all similar 6-mers between pairs of sequences, and to a dynamic programming algorithm that operates on pairs of similar 4-mers. To increase sensitivity further, kClust can run in profile-sequence comparison mode, with profiles computed from the clusters of a previous kClust iteration. kClust is two to three orders of magnitude faster than clustering based on NCBI BLAST, and on multidomain sequences of 20%–30% maximum pairwise sequence identity it achieves comparable sensitivity and a lower false discovery rate. It also compares favorably to CD-HIT and UCLUST in terms of false discovery rate, sensitivity, and speed.

Conclusions

kClust fills the need for a fast, sensitive, and accurate tool to cluster large protein sequence databases to below 30% sequence identity. kClust is freely available under GPL at http://toolkit.lmb.uni-muenchen.de/pub/kClust/.


Url:
DOI: 10.1186/1471-2105-14-248
PubMed: 23945046
PubMed Central: 3843501

Links to Exploration step

PMC:3843501

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">kClust: fast and sensitive clustering of large protein sequence databases</title>
<author>
<name sortKey="Hauser, Maria" sort="Hauser, Maria" uniqKey="Hauser M" first="Maria" last="Hauser">Maria Hauser</name>
<affiliation>
<nlm:aff id="I1">Gene Center and Center for Integrated Protein Science (CIPSM), Ludwig-Maximilians-Universität München, Feodor-Lynen-Str. 25, Munich 81377, Germany</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Mayer, Christian E" sort="Mayer, Christian E" uniqKey="Mayer C" first="Christian E" last="Mayer">Christian E. Mayer</name>
<affiliation>
<nlm:aff id="I2">Department for Protein Evolution, Max Planck Institute for Developmental Biology, Spemannstr. 35, Tübingen 72076, Germany</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I3">Present address: D-BSSE, ETH Zuerich, Mattenstr. 26, Basel 4058, Switzerland</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Soding, Johannes" sort="Soding, Johannes" uniqKey="Soding J" first="Johannes" last="Söding">Johannes Söding</name>
<affiliation>
<nlm:aff id="I1">Gene Center and Center for Integrated Protein Science (CIPSM), Ludwig-Maximilians-Universität München, Feodor-Lynen-Str. 25, Munich 81377, Germany</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">23945046</idno>
<idno type="pmc">3843501</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3843501</idno>
<idno type="RBID">PMC:3843501</idno>
<idno type="doi">10.1186/1471-2105-14-248</idno>
<date when="2013">2013</date>
<idno type="wicri:Area/Pmc/Corpus">000934</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000934</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">kClust: fast and sensitive clustering of large protein sequence databases</title>
<author>
<name sortKey="Hauser, Maria" sort="Hauser, Maria" uniqKey="Hauser M" first="Maria" last="Hauser">Maria Hauser</name>
<affiliation>
<nlm:aff id="I1">Gene Center and Center for Integrated Protein Science (CIPSM), Ludwig-Maximilians-Universität München, Feodor-Lynen-Str. 25, Munich 81377, Germany</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Mayer, Christian E" sort="Mayer, Christian E" uniqKey="Mayer C" first="Christian E" last="Mayer">Christian E. Mayer</name>
<affiliation>
<nlm:aff id="I2">Department for Protein Evolution, Max Planck Institute for Developmental Biology, Spemannstr. 35, Tübingen 72076, Germany</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I3">Present address: D-BSSE, ETH Zuerich, Mattenstr. 26, Basel 4058, Switzerland</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Soding, Johannes" sort="Soding, Johannes" uniqKey="Soding J" first="Johannes" last="Söding">Johannes Söding</name>
<affiliation>
<nlm:aff id="I1">Gene Center and Center for Integrated Protein Science (CIPSM), Ludwig-Maximilians-Universität München, Feodor-Lynen-Str. 25, Munich 81377, Germany</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2013">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>Fueled by rapid progress in high-throughput sequencing, the size of public sequence databases doubles every two years. Searching the ever larger and more redundant databases is getting increasingly inefficient. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed, sensitivity, and readability of homology searches. However, because the clustering time is quadratic in the number of sequences, standard sequence search methods are becoming impracticable.</p>
</sec>
<sec>
<title>Results</title>
<p>Here we present a method to cluster large protein sequence databases such as UniProt within days down to 20%–30% maximum pairwise sequence identity. kClust owes its speed and sensitivity to an alignment-free prefilter that calculates the cumulative score of all similar 6-mers between pairs of sequences, and to a dynamic programming algorithm that operates on pairs of similar 4-mers. To increase sensitivity further, kClust can run in profile-sequence comparison mode, with profiles computed from the clusters of a previous kClust iteration. kClust is two to three orders of magnitude faster than clustering based on NCBI BLAST, and on multidomain sequences of 20%–30% maximum pairwise sequence identity it achieves comparable sensitivity and a lower false discovery rate. It also compares favorably to CD-HIT and UCLUST in terms of false discovery rate, sensitivity, and speed.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>kClust fills the need for a fast, sensitive, and accurate tool to cluster large protein sequence databases to below 30% sequence identity. kClust is freely available under GPL at
<ext-link ext-link-type="uri" xlink:href="http://toolkit.lmb.uni-muenchen.de/pub/kClust/">http://toolkit.lmb.uni-muenchen.de/pub/kClust/</ext-link>
.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Chubb, D" uniqKey="Chubb D">D Chubb</name>
</author>
<author>
<name sortKey="Jefferys, Br" uniqKey="Jefferys B">BR Jefferys</name>
</author>
<author>
<name sortKey="Sternberg, Mje" uniqKey="Sternberg M">MJE Sternberg</name>
</author>
<author>
<name sortKey="Kelley, La" uniqKey="Kelley L">LA Kelley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, W" uniqKey="Li W">W Li</name>
</author>
<author>
<name sortKey="Jaroszewski, L" uniqKey="Jaroszewski L">L Jaroszewski</name>
</author>
<author>
<name sortKey="Godzik, A" uniqKey="Godzik A">A Godzik</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Park, J" uniqKey="Park J">J Park</name>
</author>
<author>
<name sortKey="Holm, L" uniqKey="Holm L">L Holm</name>
</author>
<author>
<name sortKey="Heger, A" uniqKey="Heger A">A Heger</name>
</author>
<author>
<name sortKey="Chothia, C" uniqKey="Chothia C">C Chothia</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Suzek, B" uniqKey="Suzek B">B Suzek</name>
</author>
<author>
<name sortKey="Huang, H" uniqKey="Huang H">H Huang</name>
</author>
<author>
<name sortKey="Mcgarvey, P" uniqKey="Mcgarvey P">P McGarvey</name>
</author>
<author>
<name sortKey="Mazumder, R" uniqKey="Mazumder R">R Mazumder</name>
</author>
<author>
<name sortKey="Wu, C" uniqKey="Wu C">C Wu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rusch, Db" uniqKey="Rusch D">DB Rusch</name>
</author>
<author>
<name sortKey="Halpern, Al" uniqKey="Halpern A">AL Halpern</name>
</author>
<author>
<name sortKey="Sutton, G" uniqKey="Sutton G">G Sutton</name>
</author>
<author>
<name sortKey="Heidelberg, Kb" uniqKey="Heidelberg K">KB Heidelberg</name>
</author>
<author>
<name sortKey="Williamson, S" uniqKey="Williamson S">S Williamson</name>
</author>
<author>
<name sortKey="Yooseph, S" uniqKey="Yooseph S">S Yooseph</name>
</author>
<author>
<name sortKey="Wu, D" uniqKey="Wu D">D Wu</name>
</author>
<author>
<name sortKey="Eisen, Ja" uniqKey="Eisen J">JA Eisen</name>
</author>
<author>
<name sortKey="Hoffman, Jm" uniqKey="Hoffman J">JM Hoffman</name>
</author>
<author>
<name sortKey="Remington, K" uniqKey="Remington K">K Remington</name>
</author>
<author>
<name sortKey="Beeson, K" uniqKey="Beeson K">K Beeson</name>
</author>
<author>
<name sortKey="Tran, B" uniqKey="Tran B">B Tran</name>
</author>
<author>
<name sortKey="Baden Tillson, H" uniqKey="Baden Tillson H">H Baden-Tillson</name>
</author>
<author>
<name sortKey="Stewart, C" uniqKey="Stewart C">C Stewart</name>
</author>
<author>
<name sortKey="Thorpe, J" uniqKey="Thorpe J">J Thorpe</name>
</author>
<author>
<name sortKey="Freeman, J" uniqKey="Freeman J">J Freeman</name>
</author>
<author>
<name sortKey="Andrews Pfannkoch, C" uniqKey="Andrews Pfannkoch C">C Andrews-Pfannkoch</name>
</author>
<author>
<name sortKey="Venter, Je" uniqKey="Venter J">JE Venter</name>
</author>
<author>
<name sortKey="Li, K" uniqKey="Li K">K Li</name>
</author>
<author>
<name sortKey="Kravitz, S" uniqKey="Kravitz S">S Kravitz</name>
</author>
<author>
<name sortKey="Heidelberg, Jf" uniqKey="Heidelberg J">JF Heidelberg</name>
</author>
<author>
<name sortKey="Utterback, T" uniqKey="Utterback T">T Utterback</name>
</author>
<author>
<name sortKey="Rogers, Y" uniqKey="Rogers Y">Y Rogers</name>
</author>
<author>
<name sortKey="Falc N, Li" uniqKey="Falc N L">LI Falcón</name>
</author>
<author>
<name sortKey="Souza, V" uniqKey="Souza V">V Souza</name>
</author>
<author>
<name sortKey="Bonilla Rosso, G" uniqKey="Bonilla Rosso G">G Bonilla-Rosso</name>
</author>
<author>
<name sortKey="Eguiarte, Le" uniqKey="Eguiarte L">LE Eguiarte</name>
</author>
<author>
<name sortKey="Karl, Dm" uniqKey="Karl D">DM Karl</name>
</author>
<author>
<name sortKey="Sathyendranath, S" uniqKey="Sathyendranath S">S Sathyendranath</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Qin, J" uniqKey="Qin J">J Qin</name>
</author>
<author>
<name sortKey="Li, R" uniqKey="Li R">R Li</name>
</author>
<author>
<name sortKey="Raes, J" uniqKey="Raes J">J Raes</name>
</author>
<author>
<name sortKey="Arumugam, M" uniqKey="Arumugam M">M Arumugam</name>
</author>
<author>
<name sortKey="Burgdorf, Ks" uniqKey="Burgdorf K">KS Burgdorf</name>
</author>
<author>
<name sortKey="Manichanh, C" uniqKey="Manichanh C">C Manichanh</name>
</author>
<author>
<name sortKey="Nielsen, T" uniqKey="Nielsen T">T Nielsen</name>
</author>
<author>
<name sortKey="Pons, N" uniqKey="Pons N">N Pons</name>
</author>
<author>
<name sortKey="Levenez, F" uniqKey="Levenez F">F Levenez</name>
</author>
<author>
<name sortKey="Yamada, T" uniqKey="Yamada T">T Yamada</name>
</author>
<author>
<name sortKey="Mende, Dr" uniqKey="Mende D">DR Mende</name>
</author>
<author>
<name sortKey="Li, J" uniqKey="Li J">J Li</name>
</author>
<author>
<name sortKey="Xu, J" uniqKey="Xu J">J Xu</name>
</author>
<author>
<name sortKey="Li, S" uniqKey="Li S">S Li</name>
</author>
<author>
<name sortKey="Li, D" uniqKey="Li D">D Li</name>
</author>
<author>
<name sortKey="Cao, J" uniqKey="Cao J">J Cao</name>
</author>
<author>
<name sortKey="Wang, B" uniqKey="Wang B">B Wang</name>
</author>
<author>
<name sortKey="Liang, H" uniqKey="Liang H">H Liang</name>
</author>
<author>
<name sortKey="Zheng, H" uniqKey="Zheng H">H Zheng</name>
</author>
<author>
<name sortKey="Xie, Y" uniqKey="Xie Y">Y Xie</name>
</author>
<author>
<name sortKey="Tap, J" uniqKey="Tap J">J Tap</name>
</author>
<author>
<name sortKey="Lepage, P" uniqKey="Lepage P">P Lepage</name>
</author>
<author>
<name sortKey="Bertalan, M" uniqKey="Bertalan M">M Bertalan</name>
</author>
<author>
<name sortKey="Batto, Jm" uniqKey="Batto J">JM Batto</name>
</author>
<author>
<name sortKey="Hansen, T" uniqKey="Hansen T">T Hansen</name>
</author>
<author>
<name sortKey="Le Paslier, D" uniqKey="Le Paslier D">D Le Paslier</name>
</author>
<author>
<name sortKey="Linneberg, A" uniqKey="Linneberg A">A Linneberg</name>
</author>
<author>
<name sortKey="Nielsen, Hb" uniqKey="Nielsen H">HB Nielsen</name>
</author>
<author>
<name sortKey="Pelletier, E" uniqKey="Pelletier E">E Pelletier</name>
</author>
<author>
<name sortKey="Renault, P" uniqKey="Renault P">P Renault</name>
</author>
<author>
<name sortKey="Et, Al" uniqKey="Et A">al et</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Remmert, M" uniqKey="Remmert M">M Remmert</name>
</author>
<author>
<name sortKey="Biegert, A" uniqKey="Biegert A">A Biegert</name>
</author>
<author>
<name sortKey="Hauser, A" uniqKey="Hauser A">A Hauser</name>
</author>
<author>
<name sortKey="Soding, J" uniqKey="Soding J">J Söding</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author>
<name sortKey="Madden, Tl" uniqKey="Madden T">TL Madden</name>
</author>
<author>
<name sortKey="Sch Ffer, Aa" uniqKey="Sch Ffer A">AA Schäffer</name>
</author>
<author>
<name sortKey="Zhang, J" uniqKey="Zhang J">J Zhang</name>
</author>
<author>
<name sortKey="Zhang, Z" uniqKey="Zhang Z">Z Zhang</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
<author>
<name sortKey="Lipman, Dj" uniqKey="Lipman D">DJ Lipman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pearson, W" uniqKey="Pearson W">W Pearson</name>
</author>
<author>
<name sortKey="Lipman, D" uniqKey="Lipman D">D Lipman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author>
<name sortKey="Gish, W" uniqKey="Gish W">W Gish</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
<author>
<name sortKey="Myers, Ew" uniqKey="Myers E">EW Myers</name>
</author>
<author>
<name sortKey="Lipman, Dj" uniqKey="Lipman D">DJ Lipman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yona, G" uniqKey="Yona G">G Yona</name>
</author>
<author>
<name sortKey="Linial, N" uniqKey="Linial N">N Linial</name>
</author>
<author>
<name sortKey="Linial, M" uniqKey="Linial M">M Linial</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Krause, A" uniqKey="Krause A">A Krause</name>
</author>
<author>
<name sortKey="Stoye, J" uniqKey="Stoye J">J Stoye</name>
</author>
<author>
<name sortKey="Vingron, M" uniqKey="Vingron M">M Vingron</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Miele, V" uniqKey="Miele V">V Miele</name>
</author>
<author>
<name sortKey="Penel, S" uniqKey="Penel S">S Penel</name>
</author>
<author>
<name sortKey="Duret, L" uniqKey="Duret L">L Duret</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rappoport, N" uniqKey="Rappoport N">N Rappoport</name>
</author>
<author>
<name sortKey="Karsenty, S" uniqKey="Karsenty S">S Karsenty</name>
</author>
<author>
<name sortKey="Stern, A" uniqKey="Stern A">A Stern</name>
</author>
<author>
<name sortKey="Linial, N" uniqKey="Linial N">N Linial</name>
</author>
<author>
<name sortKey="Linial, M" uniqKey="Linial M">M Linial</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Remm, M" uniqKey="Remm M">M Remm</name>
</author>
<author>
<name sortKey="Storm, Ce" uniqKey="Storm C">CE Storm</name>
</author>
<author>
<name sortKey="Sonnhammer, El" uniqKey="Sonnhammer E">EL Sonnhammer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Enright, Aj" uniqKey="Enright A">AJ Enright</name>
</author>
<author>
<name sortKey="Van Dongen, S" uniqKey="Van Dongen S">S Van Dongen</name>
</author>
<author>
<name sortKey="Ouzounis, Ca" uniqKey="Ouzounis C">CA Ouzounis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tatusov, Rl" uniqKey="Tatusov R">RL Tatusov</name>
</author>
<author>
<name sortKey="Fedorova, Nd" uniqKey="Fedorova N">ND Fedorova</name>
</author>
<author>
<name sortKey="Jackson, Jd" uniqKey="Jackson J">JD Jackson</name>
</author>
<author>
<name sortKey="Jacobs, Ar" uniqKey="Jacobs A">AR Jacobs</name>
</author>
<author>
<name sortKey="Kiryutin, B" uniqKey="Kiryutin B">B Kiryutin</name>
</author>
<author>
<name sortKey="Koonin, Ev" uniqKey="Koonin E">EV Koonin</name>
</author>
<author>
<name sortKey="Krylov, Dm" uniqKey="Krylov D">DM Krylov</name>
</author>
<author>
<name sortKey="Mazumder, R" uniqKey="Mazumder R">R Mazumder</name>
</author>
<author>
<name sortKey="Mekhedov, Sl" uniqKey="Mekhedov S">SL Mekhedov</name>
</author>
<author>
<name sortKey="Nikolskaya, An" uniqKey="Nikolskaya A">AN Nikolskaya</name>
</author>
<author>
<name sortKey="Rao, Bs" uniqKey="Rao B">BS Rao</name>
</author>
<author>
<name sortKey="Smirnov, S" uniqKey="Smirnov S">S Smirnov</name>
</author>
<author>
<name sortKey="Sverdlov, Av" uniqKey="Sverdlov A">AV Sverdlov</name>
</author>
<author>
<name sortKey="Vasudevan, S" uniqKey="Vasudevan S">S Vasudevan</name>
</author>
<author>
<name sortKey="Wolf, Yi" uniqKey="Wolf Y">YI Wolf</name>
</author>
<author>
<name sortKey="Yin, Jj" uniqKey="Yin J">JJ Yin</name>
</author>
<author>
<name sortKey="Natale, Da" uniqKey="Natale D">DA Natale</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, L" uniqKey="Li L">L Li</name>
</author>
<author>
<name sortKey="Stoeckert, Cj" uniqKey="Stoeckert C">CJ Stoeckert</name>
</author>
<author>
<name sortKey="Roos, Ds" uniqKey="Roos D">DS Roos</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Alexeyenko, A" uniqKey="Alexeyenko A">A Alexeyenko</name>
</author>
<author>
<name sortKey="Tamas, I" uniqKey="Tamas I">I Tamas</name>
</author>
<author>
<name sortKey="Liu, G" uniqKey="Liu G">G Liu</name>
</author>
<author>
<name sortKey="Sonnhammer, El" uniqKey="Sonnhammer E">EL Sonnhammer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, Tw" uniqKey="Chen T">TW Chen</name>
</author>
<author>
<name sortKey="Wu, Th" uniqKey="Wu T">TH Wu</name>
</author>
<author>
<name sortKey="Ng, Wv" uniqKey="Ng W">WV Ng</name>
</author>
<author>
<name sortKey="Lin, Wc" uniqKey="Lin W">WC Lin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Powell, S" uniqKey="Powell S">S Powell</name>
</author>
<author>
<name sortKey="Szklarczyk, D" uniqKey="Szklarczyk D">D Szklarczyk</name>
</author>
<author>
<name sortKey="Trachana, K" uniqKey="Trachana K">K Trachana</name>
</author>
<author>
<name sortKey="Roth, A" uniqKey="Roth A">A Roth</name>
</author>
<author>
<name sortKey="Kuhn, M" uniqKey="Kuhn M">M Kuhn</name>
</author>
<author>
<name sortKey="Muller, J" uniqKey="Muller J">J Muller</name>
</author>
<author>
<name sortKey="Arnold, R" uniqKey="Arnold R">R Arnold</name>
</author>
<author>
<name sortKey="Rattei, T" uniqKey="Rattei T">T Rattei</name>
</author>
<author>
<name sortKey="Letunic, I" uniqKey="Letunic I">I Letunic</name>
</author>
<author>
<name sortKey="Doerks, T" uniqKey="Doerks T">T Doerks</name>
</author>
<author>
<name sortKey="Jensen, Lj" uniqKey="Jensen L">LJ Jensen</name>
</author>
<author>
<name sortKey="Von Mering, C" uniqKey="Von Mering C">C von Mering</name>
</author>
<author>
<name sortKey="Bork, P" uniqKey="Bork P">P Bork</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pearson, Wr" uniqKey="Pearson W">WR Pearson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rattei, T" uniqKey="Rattei T">T Rattei</name>
</author>
<author>
<name sortKey="Tischler, P" uniqKey="Tischler P">P Tischler</name>
</author>
<author>
<name sortKey="Gotz, S" uniqKey="Gotz S">S Götz</name>
</author>
<author>
<name sortKey="Jehl, Ma" uniqKey="Jehl M">MA Jehl</name>
</author>
<author>
<name sortKey="Hoser, J" uniqKey="Hoser J">J Hoser</name>
</author>
<author>
<name sortKey="Arnold, R" uniqKey="Arnold R">R Arnold</name>
</author>
<author>
<name sortKey="Conesa, A" uniqKey="Conesa A">A Conesa</name>
</author>
<author>
<name sortKey="Mewes, Hw" uniqKey="Mewes H">HW Mewes</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, W" uniqKey="Li W">W Li</name>
</author>
<author>
<name sortKey="Jaroszewski, L" uniqKey="Jaroszewski L">L Jaroszewski</name>
</author>
<author>
<name sortKey="Godzik, A" uniqKey="Godzik A">A Godzik</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, W" uniqKey="Li W">W Li</name>
</author>
<author>
<name sortKey="Godzik, A" uniqKey="Godzik A">A Godzik</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fu, L" uniqKey="Fu L">L Fu</name>
</author>
<author>
<name sortKey="Niu, B" uniqKey="Niu B">B Niu</name>
</author>
<author>
<name sortKey="Zhu, Z" uniqKey="Zhu Z">Z Zhu</name>
</author>
<author>
<name sortKey="Wu, S" uniqKey="Wu S">S Wu</name>
</author>
<author>
<name sortKey="Li, W" uniqKey="Li W">W Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Edgar, Rc" uniqKey="Edgar R">RC Edgar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hobohm, U" uniqKey="Hobohm U">U Hobohm</name>
</author>
<author>
<name sortKey="Scharf, M" uniqKey="Scharf M">M Scharf</name>
</author>
<author>
<name sortKey="Schneider, R" uniqKey="Schneider R">R Schneider</name>
</author>
<author>
<name sortKey="Sander, C" uniqKey="Sander C">C Sander</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ma, B" uniqKey="Ma B">B Ma</name>
</author>
<author>
<name sortKey="Tromp, J" uniqKey="Tromp J">J Tromp</name>
</author>
<author>
<name sortKey="Li, M" uniqKey="Li M">M Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mayer, Ce" uniqKey="Mayer C">CE Mayer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Przybylski, D" uniqKey="Przybylski D">D Przybylski</name>
</author>
<author>
<name sortKey="Rost, B" uniqKey="Rost B">B Rost</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Finn, Rd" uniqKey="Finn R">RD Finn</name>
</author>
<author>
<name sortKey="Mistry, J" uniqKey="Mistry J">J Mistry</name>
</author>
<author>
<name sortKey="Tate, J" uniqKey="Tate J">J Tate</name>
</author>
<author>
<name sortKey="Coggill, P" uniqKey="Coggill P">P Coggill</name>
</author>
<author>
<name sortKey="Heger, A" uniqKey="Heger A">A Heger</name>
</author>
<author>
<name sortKey="Pollington, Ja" uniqKey="Pollington J">JA Pollington</name>
</author>
<author>
<name sortKey="Gavin, Ol" uniqKey="Gavin O">OL Gavin</name>
</author>
<author>
<name sortKey="Gunasekaran, P" uniqKey="Gunasekaran P">P Gunasekaran</name>
</author>
<author>
<name sortKey="Ceric, G" uniqKey="Ceric G">G Ceric</name>
</author>
<author>
<name sortKey="Forslund, K" uniqKey="Forslund K">K Forslund</name>
</author>
<author>
<name sortKey="Holm, L" uniqKey="Holm L">L Holm</name>
</author>
<author>
<name sortKey="Sonnhammer, El" uniqKey="Sonnhammer E">EL Sonnhammer</name>
</author>
<author>
<name sortKey="Eddy, Sr" uniqKey="Eddy S">SR Eddy</name>
</author>
<author>
<name sortKey="Bateman, A" uniqKey="Bateman A">A Bateman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lo Conte, L" uniqKey="Lo Conte L">L Lo Conte</name>
</author>
<author>
<name sortKey="Ailey, B" uniqKey="Ailey B">B Ailey</name>
</author>
<author>
<name sortKey="Hubbard, Tj" uniqKey="Hubbard T">TJ Hubbard</name>
</author>
<author>
<name sortKey="Brenner, Se" uniqKey="Brenner S">SE Brenner</name>
</author>
<author>
<name sortKey="Murzin, Ag" uniqKey="Murzin A">AG Murzin</name>
</author>
<author>
<name sortKey="Chothia, C" uniqKey="Chothia C">C Chothia</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Apweiler, R" uniqKey="Apweiler R">R Apweiler</name>
</author>
<author>
<name sortKey="Bairoch, A" uniqKey="Bairoch A">A Bairoch</name>
</author>
<author>
<name sortKey="Wu, Ch" uniqKey="Wu C">CH Wu</name>
</author>
<author>
<name sortKey="Barker, Wc" uniqKey="Barker W">WC Barker</name>
</author>
<author>
<name sortKey="Boeckmann, B" uniqKey="Boeckmann B">B Boeckmann</name>
</author>
<author>
<name sortKey="Ferro, S" uniqKey="Ferro S">S Ferro</name>
</author>
<author>
<name sortKey="Gasteiger, E" uniqKey="Gasteiger E">E Gasteiger</name>
</author>
<author>
<name sortKey="Huang, H" uniqKey="Huang H">H Huang</name>
</author>
<author>
<name sortKey="Lopez, R" uniqKey="Lopez R">R Lopez</name>
</author>
<author>
<name sortKey="Magrane, M" uniqKey="Magrane M">M Magrane</name>
</author>
<author>
<name sortKey="Martin, Mj" uniqKey="Martin M">MJ Martin</name>
</author>
<author>
<name sortKey="Natale, Da" uniqKey="Natale D">DA Natale</name>
</author>
<author>
<name sortKey="O Onovan, C" uniqKey="O Onovan C">C O’Donovan</name>
</author>
<author>
<name sortKey="Redaschi, N" uniqKey="Redaschi N">N Redaschi</name>
</author>
<author>
<name sortKey="Yeh, Ls" uniqKey="Yeh L">LS Yeh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hegyi, H" uniqKey="Hegyi H">H Hegyi</name>
</author>
<author>
<name sortKey="Gerstein, M" uniqKey="Gerstein M">M Gerstein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bao, E" uniqKey="Bao E">E Bao</name>
</author>
<author>
<name sortKey="Jiang, T" uniqKey="Jiang T">T Jiang</name>
</author>
<author>
<name sortKey="Kaloshian, I" uniqKey="Kaloshian I">I Kaloshian</name>
</author>
<author>
<name sortKey="Girke, T" uniqKey="Girke T">T Girke</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="product-review" xml:lang="en">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">23945046</article-id>
<article-id pub-id-type="pmc">3843501</article-id>
<article-id pub-id-type="publisher-id">1471-2105-14-248</article-id>
<article-id pub-id-type="doi">10.1186/1471-2105-14-248</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Software</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>kClust: fast and sensitive clustering of large protein sequence databases</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" equal-contrib="yes" id="A1">
<name>
<surname>Hauser</surname>
<given-names>Maria</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>mhauser@genzentrum.lmu.de</email>
</contrib>
<contrib contrib-type="author" equal-contrib="yes" id="A2">
<name>
<surname>Mayer</surname>
<given-names>Christian E</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<xref ref-type="aff" rid="I3">3</xref>
<email>christian.mayer@bsse.ethz.ch</email>
</contrib>
<contrib contrib-type="author" corresp="yes" id="A3">
<name>
<surname>Söding</surname>
<given-names>Johannes</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>soeding@genzentrum.lmu.de</email>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
Gene Center and Center for Integrated Protein Science (CIPSM), Ludwig-Maximilians-Universität München, Feodor-Lynen-Str. 25, Munich 81377, Germany</aff>
<aff id="I2">
<label>2</label>
Department for Protein Evolution, Max Planck Institute for Developmental Biology, Spemannstr. 35, Tübingen 72076, Germany</aff>
<aff id="I3">
<label>3</label>
Present address: D-BSSE, ETH Zuerich, Mattenstr. 26, Basel 4058, Switzerland</aff>
<pub-date pub-type="collection">
<year>2013</year>
</pub-date>
<pub-date pub-type="epub">
<day>15</day>
<month>8</month>
<year>2013</year>
</pub-date>
<volume>14</volume>
<fpage>248</fpage>
<lpage>248</lpage>
<history>
<date date-type="received">
<day>24</day>
<month>4</month>
<year>2013</year>
</date>
<date date-type="accepted">
<day>12</day>
<month>8</month>
<year>2013</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright © 2013 Hauser et al.; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2013</copyright-year>
<copyright-holder>Hauser et al.; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1471-2105/14/248"></self-uri>
<abstract>
<sec>
<title>Background</title>
<p>Fueled by rapid progress in high-throughput sequencing, the size of public sequence databases doubles every two years. Searching the ever larger and more redundant databases is getting increasingly inefficient. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed, sensitivity, and readability of homology searches. However, because the clustering time is quadratic in the number of sequences, standard sequence search methods are becoming impracticable.</p>
</sec>
<sec>
<title>Results</title>
<p>Here we present a method to cluster large protein sequence databases such as UniProt within days down to 20%–30% maximum pairwise sequence identity. kClust owes its speed and sensitivity to an alignment-free prefilter that calculates the cumulative score of all similar 6-mers between pairs of sequences, and to a dynamic programming algorithm that operates on pairs of similar 4-mers. To increase sensitivity further, kClust can run in profile-sequence comparison mode, with profiles computed from the clusters of a previous kClust iteration. kClust is two to three orders of magnitude faster than clustering based on NCBI BLAST, and on multidomain sequences of 20%–30% maximum pairwise sequence identity it achieves comparable sensitivity and a lower false discovery rate. It also compares favorably to CD-HIT and UCLUST in terms of false discovery rate, sensitivity, and speed.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>kClust fills the need for a fast, sensitive, and accurate tool to cluster large protein sequence databases to below 30% sequence identity. kClust is freely available under GPL at
<ext-link ext-link-type="uri" xlink:href="http://toolkit.lmb.uni-muenchen.de/pub/kClust/">http://toolkit.lmb.uni-muenchen.de/pub/kClust/</ext-link>
.</p>
</sec>
</abstract>
</article-meta>
</front>
<body>
<sec>
<title>Background</title>
<p>In recent years, the amount of sequence data has been growing at an accelerating pace. While one would expect the denser sampling of sequence space to lead to better performance of sequence searches, the opposite seems to be true: The increase has led to stagnating or even negative returns, as measured by the ability to detect homologous sequences for structure or function predictions [
<xref ref-type="bibr" rid="B1">1</xref>
]. Removing redundant sequences through clustering can partly alleviate this problem: [
<xref ref-type="bibr" rid="B2">2</xref>
] and [
<xref ref-type="bibr" rid="B3">3</xref>
] showed that sequence searches through clustered databases, which contain one representative sequence per cluster, could improve the sensitivity of homology search methods. Furthermore, clustering reduces search times and can greatly improve the readability of search results [
<xref ref-type="bibr" rid="B4">4</xref>
]. Clustering is also often used in the analysis of metagenomics experiments, where potentially orthologous sequences are clustered together in order to define the inventory of biochemical reactions that are likely to be present in the metagenomic community [
<xref ref-type="bibr" rid="B5">5</xref>
-
<xref ref-type="bibr" rid="B7">7</xref>
].</p>
<p>An important motivation to develop kClust was our need for a database of profile hidden Markov models (HMMs) that contains all UniProt sequences, clustered down to 20%–30% sequence identity (uniprot20). Such a database is required by HHblits, the most sensitive protein sequence search method to date [
<xref ref-type="bibr" rid="B8">8</xref>
]. HHblits calculates a profile HMM from the query sequence and compares this query HMM with the uniprot20 database of HMMs. In subsequent iterations, sequences belonging to significantly similar database HMMs are added to the query multiple sequence alignment (MSA). Thanks to the additional information from homologous sequences in the MSAs, HMM-HMM comparison is more sensitive and accurate than profile-sequence comparison: Compared to PSI-BLAST [
<xref ref-type="bibr" rid="B9">9</xref>
], HHblits is faster, up to twice as sensitive, and produces alignments with several percent higher accuracy [
<xref ref-type="bibr" rid="B8">8</xref>
]. For HHblits it is critical that only a very low number of clusters are corrupted by non-homologous sequences, since these can cause high-scoring false-positive matches. The clustering sensitivity is also important because MSAs with higher sequence diversity and higher information content are better at finding remotely related sequences. Also, the lower number of clusters results in shorter search times.</p>
<p>Sequence clustering methods first need to compare the sequences in the input set with each other. Most methods use FASTA or BLAST [
<xref ref-type="bibr" rid="B10">10</xref>
,
<xref ref-type="bibr" rid="B11">11</xref>
] for this purpose. Some of these methods use the pairwise sequence similarities for hierarchical clustering [
<xref ref-type="bibr" rid="B12">12</xref>
-
<xref ref-type="bibr" rid="B15">15</xref>
]. Others use them to cluster sequences into orthologous or functionally similar groups [
<xref ref-type="bibr" rid="B16">16</xref>
-
<xref ref-type="bibr" rid="B22">22</xref>
]. FASTA and BLAST are two to three orders of magnitude faster than Smith-Waterman search [
<xref ref-type="bibr" rid="B23">23</xref>
] due to their fast,
<italic>k</italic>
-mer-based heuristic filters, yet they would require around 80 years of CPU time to cluster the UniProt database version with ∼35.5 million sequence from scratch. SIMAP [
<xref ref-type="bibr" rid="B24">24</xref>
] employs a 9-TeraFLOP distributed network of computers to regularly update their database of precomputed FASTA similarity scores for a large set of protein sequences. This database can be tapped to avoid the time-consuming sequence alignments [
<xref ref-type="bibr" rid="B22">22</xref>
], but it cannot be used for sequences not yet contained in SIMAP or when the clustering should depend on information other than sequence similarity scores.</p>
<p>CD-HIT [
<xref ref-type="bibr" rid="B25">25</xref>
-
<xref ref-type="bibr" rid="B27">27</xref>
] and UCLUST [
<xref ref-type="bibr" rid="B28">28</xref>
] are sequence clustering methods that, like kClust, do not rely on an external sequence search tool. All three methods aim at clustering large sequence databases much faster than what is possible using BLAST or FASTA. They start with zero clusters and pick one sequence after the other from the database. If the query is sufficiently similar to the representative sequence of a cluster, the query is added to that cluster. Otherwise it becomes the founding member and representative sequence for a new cluster. For the fast comparison of the query to the representative sequences of the clusters, these methods first prefilter the representative sequences using an alignment-free,
<italic>k</italic>
-mer-based sequence comparison. Sequences that pass the prefilter are aligned to the query using Smith-Waterman alignment. The clustering finishes when all sequences have been picked and assigned to a cluster.</p>
<p>For prefiltering, CD-HIT and UCLUST simply count the number of identical
<italic>k</italic>
-mers between sequences. Because this number drops quickly as the similarity of the compared sequences decreases, CD-HIT uses shorter
<italic>k</italic>
-mers for achieving higher sensitivity: (e. g. 5-mers for clustering thresholds down to 70% sequence identity, and 2-mers below 50%). The choice of
<italic>k</italic>
follows from the requirement for near-perfect sensitivity, between 95% and 99%. Reducing
<italic>k</italic>
comes at a considerable loss of speed, however, since decrementing
<italic>k</italic>
by one results in approximately 20 times more chance
<italic>k</italic>
-mers matches and a 20-fold longer run time. CD-HIT can therefore cluster large databases such as UniProt only to down to ∼ 50% sequence identity. UCLUST uses 5-mers at all clustering thresholds. This allows it to maintain a high speed even at low thresholds, at the cost of a loss of sensitivity. It ranks the representative sequences by the number of 5-mers they have in common with the query sequence and aligns them in this order until one of them is similar to the query sequence or until the highest-ranked eight clusters have been rejected. An apparent disadvantage of UCLUST is that, in order to increase sensitivity despite its word length of 5, it uses rather loose acceptance criteria. At clustering thresholds below 50%, this leads to a high fraction of corrupted clusters containing non-homologous sequences.</p>
<p>Both CD-HIT and UCLUST use banded dynamic programming to speed up the Smith-Waterman search. In addition, UCLUST extends the alignment around identical 5-mer matches in a way similar to BLAST [
<xref ref-type="bibr" rid="B11">11</xref>
].</p>
<p>kClust achieves high sensitivity by allowing matches between similar k-mers and ranking sequence pairs by the sum of similarity scores over all similar k-mer pairs.</p>
<p>We benchmark the performance of kClust and other tools to cluster large sequence databases well below 50% sequence identity. Our results show that kClust achieves false discovery rates similar to BLAST at much higher speeds, whereas at these low clustering thresholds CD-HIT and UCLUST manifest severe limitations in sensitivity, false discovery rate, and speed.</p>
</sec>
<sec>
<title>Implementation</title>
<sec>
<title>Overview of kClust algorithm</title>
<p>kClust, like CD-HIT and UCLUST [
<xref ref-type="bibr" rid="B26">26</xref>
,
<xref ref-type="bibr" rid="B28">28</xref>
], uses the incremental, greedy clustering strategy of [
<xref ref-type="bibr" rid="B29">29</xref>
] (Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
: Figure S1). First, all sequences are sorted by length. Starting with the longest one, the next sequence is picked from the database as query and is compared with the representative sequences representing the already created clusters. If the query fulfills kClust’s similarity criteria with one of the representative sequences, the query is added to that cluster, otherwise a new cluster is created for which the query becomes the representative sequence.</p>
<p>A fast, alignment-free prefilter reduces the number of costly computations of sequence alignments. Whereas CD-HIT and UCLUST count exact
<italic>k</italic>
-mer matches, kClust calculates the sum of similarity scores over all
<italic>similar</italic>
6-mers. To increase speed, alignments are constructed with a novel algorithm,
<italic>k</italic>
-mer dynamic programming (kDP), which finds the local optimal alignment passing through pairs of similar 4-mers (C.E.M. and J.S., to be published). Furthermore, kClust employs spaced 6-mers and 4-mers that reduce the noise caused by the correlation between scores of neighboring
<italic>k</italic>
-mer matches [
<xref ref-type="bibr" rid="B30">30</xref>
].</p>
<p>Three similarity criteria are used to decide if a query is added to a cluster: (1) The sequence similarity score from the 4-mer-based dynamic programming algorithm is larger than a minimum BLOSUM62 score per column (default 1.12 half bits, which corresponds to a sequence identity of ∼30
<italic>%</italic>
, see Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
: Figure S2), (2) the alignment achieves an E-value less than a defined threshold (default value 1E-3), and (3) the alignment covers at least 80% of the residues of the representative sequence. This criterion ensures that clusters contain sequences with nearly identical domain composition.</p>
<p>In a second iteration, profile-based kClust can also merge homologous clusters. It computes sequence profiles of the clusters generated in the previous step and uses these for scoring during the prefilter and alignment stages. This further improves the sensitivity without loss of speed.</p>
</sec>
<sec>
<title>Prefiltering</title>
<p>kClust’s prefilter sums up the substitution matrix scores of all 6-mers above a certain score threshold
<italic>S</italic>
<sub>min</sub>
. (Default is 4.3 half-bits per column, or 12.9 bits, for a clustering threshold corresponding to roughly 30%, resulting in similar
<italic>k</italic>
-mer lists of an average length 100 per sequence position). As explained in detail in the supplementary discussion (Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
), scoring
<italic>similar</italic>
<italic>k</italic>
-mers confers a great advantage over counting identical
<italic>k</italic>
-mers. Briefly, by allowing non-identical matches a sufficient number of matches will occur even at low sequence similarities and for a word length as large as 6. On the other hand, the number of chance
<italic>k</italic>
-mer matches expected for unrelated sequences strongly decreases with increasing
<italic>k</italic>
, which in turn improves both the discrimination between true and false positives and the speed of the prefilter. This is illustrated in Figure 
<xref ref-type="fig" rid="F1">1</xref>
, which shows the distribution of identical 3-mer matches and similar 6-mer matches for two proteins with a sequence identity of 45%.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption>
<p>
<bold>
<italic>k</italic>
</bold>
<bold>-mer matches comparison.</bold>
Comparison of exact 3-mer matches
<bold>(A)</bold>
versus similar 6-mer matches with
<italic>r</italic>
 = 100 
<bold>(B)</bold>
between two proteins with 45% sequence identity.</p>
</caption>
<graphic xlink:href="1471-2105-14-248-1"></graphic>
</fig>
<p>The algorithm to calculate the sum of
<italic>k</italic>
-mer similarity scores is described in Figure 
<xref ref-type="fig" rid="F2">2</xref>
. Sequences whose score is above a certain threshold are aligned to the query in the next step (kDP). The default prefilter score threshold is set to 0.55 half bits per query position, which results in a 98% sensitivity level (see Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
: Figure S3). The array index in the index table at which the pointer for each
<italic>k</italic>
-mer (
<italic>x</italic>
<sub>1</sub>
,…,
<italic>x</italic>
<sub>
<italic>k</italic>
</sub>
) is stored is calculated as
<inline-formula>
<mml:math id="M1" name="1471-2105-14-248-i1" overflow="scroll">
<mml:msubsup>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>|</mml:mo>
<mml:mo>Σ</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>
, where |Σ| = 21 is the size of the amino acid alphabet.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption>
<p>
<bold>Prefiltering step in kClust.</bold>
Prefiltering algorithm: For each
<italic>k</italic>
-mer in the query (
<italic>k</italic>
 = 6), a list of similar
<italic>k</italic>
-mers and their BLOSUM62 similarity scores is generated (blue frame). For each such
<italic>k</italic>
-mer (red), a pointer to a list of representative sequences containing this
<italic>k</italic>
-mer is looked up in an array (index table). The score
<italic>S</italic>
of each sequence in that list is increased by the similarity score. After all
<italic>k</italic>
-mers in the query have been processed, array
<italic>S</italic>
contains for each representative sequence the sum of
<italic>k</italic>
-mer similarity scores.</p>
</caption>
<graphic xlink:href="1471-2105-14-248-2"></graphic>
</fig>
<sec>
<title>Generating lists of similar
<italic>k</italic>
-mers</title>
<p>A branch-and-bound algorithm generates the list of all
<italic>k</italic>
-mers that have a BLOSUM62 score above a specified threshold
<italic>S</italic>
<sub>
<italic>min</italic>
</sub>
. First, we calculate for each position
<italic>j</italic>
in the
<italic>k</italic>
-mer the minimum score
<italic>S</italic>
<sub>min,
<italic>j</italic>
</sub>
that is necessary to reach a total score above the threshold
<italic>S</italic>
<sub>min</sub>
. This score is simply
<italic>S</italic>
<sub>min</sub>
minus the maximum score achievable for
<italic>k</italic>
-mer positions
<italic>j</italic>
 + 1,…,
<italic>k</italic>
 − 1: </p>
<p>
<disp-formula id="bmcM1">
<label>(1)</label>
<mml:math id="M2" name="1471-2105-14-248-i2" overflow="scroll">
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>min</mml:mtext>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>min</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>BLOSUM</mml:mtext>
<mml:mn>62</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mspace width="2.22144pt"></mml:mspace>
<mml:mo>,</mml:mo>
</mml:math>
</disp-formula>
</p>
<p>where
<italic>S</italic>
<sub>BLOSUM62</sub>
(
<italic>x</italic>
,
<italic>y</italic>
) denote the BLOSUM62 substitution matrix scores. The list of similar
<italic>k</italic>
-mers is then generated iteratively position by position, by appending each of the 20 amino acids to the current
<italic>j</italic>
-mer and ending the branch if the score of the current
<italic>j</italic>
-mer is lower than the minimum required,
<italic>S</italic>
<sub>min,
<italic>j</italic>
</sub>
. Using lists of amino acid dimers speeds up the process further by generating two amino acids instead of one at each step. Each dimer list is pre-sorted according to the score to the query dimer, so the branch can be skipped after the score of the
<italic>j</italic>
-mer falls below
<italic>S</italic>
<sub>min,
<italic>j</italic>
</sub>
. Precomputed intermediate thresholds at each position and presorted dimer lists ensure that only
<italic>k</italic>
-mers with the overall score above the similarity threshold are generated. Since no
<italic>k</italic>
-mers except those with a score above the threshold are generated, this step has a time complexity of
<italic>O</italic>
(
<italic>r</italic>
).</p>
</sec>
<sec>
<title>Time complexity</title>
<p>The run time of the prefilter is the sum of two terms, the time to generate the lists of similar
<italic>k</italic>
-mers plus a term that is proportional to the number of
<italic>k</italic>
-mer matches between the compared sequences. The algorithm for generating the lists of similar
<italic>k</italic>
-mers has time complexity
<italic>O</italic>
(
<italic>r</italic>
), where
<italic>r</italic>
is the average length of the lists of similar
<italic>k</italic>
-mers. The second term stems from registering the
<italic>k</italic>
-mer matches with the representative sequences for each
<italic>k</italic>
-mer in the list. There are on average
<italic>N</italic>
<sub>clu</sub>
<italic>L</italic>
/20
<sup>6</sup>
representative sequences containing a match to one query
<italic>k</italic>
-mer, where
<italic>N</italic>
<sub>clu</sub>
is the number of clusters (i.e. of representative sequences) produced in the clustering and
<italic>L</italic>
is their average length. The run time is therefore dominated by the second term, which has a time complexity </p>
<p>
<disp-formula id="bmcM2">
<label>(2)</label>
<mml:math id="M3" name="1471-2105-14-248-i3" overflow="scroll">
<mml:mi>O</mml:mi>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>db</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>clu</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>match</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mspace width="2.22144pt"></mml:mspace>
<mml:mo>,</mml:mo>
</mml:math>
</disp-formula>
</p>
<p>where
<italic>N</italic>
<sub>db</sub>
is the number of sequences in the database and
<italic>p</italic>
<sub>match</sub>
is the probability of a chance match above the similarity threshold between two
<italic>k</italic>
-mers. Note that before a new query sequence can be processed, the score array
<italic>S</italic>
[·] (see Figure 
<xref ref-type="fig" rid="F2">2</xref>
) holding the prefiltering scores for the database sequences for this query has to be reset to 0. The naive resetting would result in a time complexity of
<inline-formula>
<mml:math id="M4" name="1471-2105-14-248-i4" overflow="scroll">
<mml:mi>O</mml:mi>
<mml:mo>(</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>db</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>)</mml:mo>
</mml:math>
</inline-formula>
which would be too time-consuming. For each query, we therefore start with an empty list to which we add all sequence indices whose score has been increased above zero, and we reset only the scores in that list.</p>
</sec>
</sec>
<sec>
<title>
<italic>k</italic>
-mer dynamic programming (kDP)</title>
<p>For the sequences that pass the prefiltering step, pairwise alignments are calculated using a fast heuristic,
<italic>k</italic>
-mer dynamic programming [
<xref ref-type="bibr" rid="B31">31</xref>
]. kDP records the similar 4-mer matches in the two sequences and determines the optimal alignment passing through the matched 4-mers. The optimal alignment is the one maximizing the sum of
<italic>k</italic>
-mer similarity scores minus gap penalties. In a second step, the full, residue-wise alignment with optimal BLOSUM62 score that passes through the 4-mers on the optimal kDP alignment is determined. Since kDP operates only on the similar 4-mer matches, the run time is proportional to number of matches,
<italic>p</italic>
<sub>match</sub>
<italic>L</italic>
<sup>2</sup>
, where
<italic>L</italic>
is the length of the sequences. Hence, the run time can be reduced in principle by a factor of
<italic>p</italic>
<sub>match</sub>
. In kClust, we chose the similarity threshold such that the lists of similar 4-mers have an average length of
<italic>r</italic>
 = 200. Therefore,
<italic>p</italic>
<sub>match</sub>
 ≈ 
<italic>r</italic>
 / 20
<sup>4</sup>
 ≈ 0.002 (see Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
).</p>
</sec>
<sec>
<title>Memory swapping</title>
<p>The index table needs 21
<sup>6</sup>
 × 8
<italic>B</italic>
 ≈ 650
<italic>M</italic>
<italic>B</italic>
of main memory on a 64 bit system. The lists of indices of sequences containing the same
<italic>k</italic>
-mers take up a space of
<italic>N</italic>
<sub>clu</sub>
<italic>L</italic>
 × 4
<italic>B</italic>
, or, assuming
<italic>N</italic>
<sub>clu</sub>
 = 3
<italic>E</italic>
6 and
<italic>L</italic>
 = 350, approximately 4
<italic>G</italic>
<italic>B</italic>
. To allow kClust to run on computers with less memory, we have implemented a memory swapping procedure (s. Figure 
<xref ref-type="fig" rid="F3">3</xref>
). The input sequences are divided into blocks (green). The first block is clustered with the usual clustering procedure, and a list of representatives is written to the hard disk (blue). Each following block is first compared to the representatives of the previous blocks (squares). In the last step, the remaining sequences are clustered producing a new list of representatives (triangles). With this procedure, kClust has only a part of the database in the main memory at each point of time.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption>
<p>
<bold>Swapping.</bold>
Memory swapping procedure of kClust. Explanation see text.</p>
</caption>
<graphic xlink:href="1471-2105-14-248-3"></graphic>
</fig>
</sec>
<sec>
<title>Iterative, profile-based clustering</title>
<p>Iterative kClust is the extension of the basic kClust algorithm that allows to further merge clusters from the first clustering run (Figure 
<xref ref-type="fig" rid="F4">4</xref>
). Iterative kClust compares clusters from the previous kClust run to each other instead of single sequences and merges clusters that are similar enough.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption>
<p>
<bold>Iterative kClust.</bold>
Overview over the iterative kClust method. First, kClust clusters the initial sequence database. Then, multiple sequence alignments are generated and profile and consensus sequences are computed for each cluster. Finally, profile-based kClust merges the clusters.</p>
</caption>
<graphic xlink:href="1471-2105-14-248-4"></graphic>
</fig>
<p>Multiple sequence alignments (MSAs) are generated for each cluster stemming from the previous iteration of kClust, and for each MSA a profile HMM and a consensus sequence are calculated with the
<monospace>hhmake</monospace>
and
<monospace>hhconsensus</monospace>
binaries of the open-source HH-suite software package [
<xref ref-type="bibr" rid="B8">8</xref>
]. The kClust algorithm proceeds in a similar way to the standard case, but instead of picking query sequences from the database, sequence profiles representing clusters are picked from the previously clustered database and compared to the clusters that have already been created in the present clustering iteration. As representative sequences of the clusters, the consensus sequences are used, which improves the sensitivity further [
<xref ref-type="bibr" rid="B32">32</xref>
]. The BLOSUM62 scores are simply replaced by the position-specific profile scores read from the profile HMM files. In the prefiltering, for instance, the lists and scores of the similar
<italic>k</italic>
-mers depend on the local 6-column window of the query profile. If no representative sequences similar to the query profile are found, the query is added as a new cluster to the database.</p>
</sec>
<sec>
<title>Spaced seeds</title>
<p>We use spaced instead of consecutive
<italic>k</italic>
-mers in the prefiltering and kDP steps. The generation of similar spaced
<italic>k</italic>
-mers for a sequence is illustrated in Figure 
<xref ref-type="fig" rid="F5">5</xref>
. Spaced
<italic>k</italic>
-mers reduce the correlation of neighboring
<italic>k</italic>
-mer scores [
<xref ref-type="bibr" rid="B30">30</xref>
]. Therefore,
<italic>k</italic>
-mer matches are more evenly distributed and the probability of high scoring clusters of chance matches is reduced.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption>
<p>
<bold>Generation of spaced</bold>
<bold>
<italic>k</italic>
</bold>
<bold>-mers.</bold>
Spaced
<italic>k</italic>
-mers: The figure illustrates the generation of a list of spaced
<italic>k</italic>
-mers in the fast kClust prefilter algorithm (cf. Figure
<xref ref-type="fig" rid="F2">2</xref>
).</p>
</caption>
<graphic xlink:href="1471-2105-14-248-5"></graphic>
</fig>
</sec>
<sec>
<title>Alignment generation and annotation of the clusters</title>
<p>The binary
<monospace>kClust_mkAln</monospace>
generates MSAs for each sequence cluster, using a user-specified multiple sequence alignment program. In addition, a cluster header with merged and redundancy-filtered names and annotations from the headers of the contained sequences is generated. The merging of UniProt and NCBI headers is supported.</p>
</sec>
</sec>
<sec>
<title>Results and discussion</title>
<sec>
<title>Benchmarked methods and parameters used</title>
<p>We compared kClust to three methods that are able to cluster large protein databases: CD-HIT, UCLUST, and BLAST-based clustering.</p>
<sec>
<title>CD-HIT</title>
<p>We clustered the datasets with the newest, parallelized version of CD-HIT [
<xref ref-type="bibr" rid="B27">27</xref>
] on 16 cores down to a sequence identity of 40%, the lowest possible value. We used the
<monospace>-n 2</monospace>
setting (
<italic>k</italic>
-mer word length) recommended for a clustering threshold of 40%. We set the minimum alignment coverage of the longer sequence to 80% with the
<monospace>-aL 0.8</monospace>
option and the maximum available memory to 6000 MB with the
<monospace>-M 6000</monospace>
option. We used parallel calculation on 16 cores of the computer by setting
<monospace>-T 0</monospace>
(i. e. all available cores).</p>
</sec>
<sec>
<title>UCLUST</title>
<p>We ran UCLUST with the
<monospace>-id 0.3</monospace>
and
<monospace>0.4</monospace>
options (“UCLUST 30” and “UCLUST 40”). Additionally, we used the
<monospace>-targetfract 0.8</monospace>
setting for the minimum alignment coverage 0.8 of the longer sequence. We tried the
<monospace>-usersort</monospace>
option that should sort the input database by length, but this led to occasional segmentation faults. We therefore gave UCLUST a length-sorted database as input. As a side note, when not crashing, UCLUST produced worse results with the
<monospace>-usersort</monospace>
option than with a presorted database.</p>
</sec>
<sec>
<title>BLAST-based clustering</title>
<p>BLAST-based clustering was calculated using
<monospace>pdbfilter.pl</monospace>
from HH-suite (
<ext-link ext-link-type="uri" xlink:href="http://toolkit.lmb.uni-muenchen.de/HH-suite/">http://toolkit.lmb.uni-muenchen.de/HH-suite/</ext-link>
), adapted to our input data. This simple procedure uses the same incremental greedy clustering strategy as CD-HIT, UCLUST, and kClust, but instead of comparing sequences using a prefilter and dynamic programming, parallel BLAST (option
<monospace>-a 8</monospace>
) was employed. The acceptance condition to add a query sequence to an existing cluster was an
<italic>E</italic>
-value below 1E-3 and an alignment that covers more than 80% of the representative sequence. No condition on the sequence identity is used, which corresponds to the “max sens” version of kClust.</p>
</sec>
<sec>
<title>kClust</title>
<p>Four parameter settings were tested that differed in the acceptance criteria for adding sequences to existing clusters. In all four cases, the maximum
<italic>E</italic>
-value was 1E-3 and the minimum alignment coverage was 80%, which are the default values. For “kClust 30” and “kClust 20”, kClust was run with a clustering threshold of 30% and 20%, respectively. “kClust sens” was run with maximum sensitivity, i.e., with a clustering threshold at 0%. For “kClust iter”, one profile-based kClust iteration was run after a single iteration with “kClust max sens” settings.</p>
</sec>
</sec>
<sec>
<title>Clustering Pfam</title>
<p>Pfam-A-seeds version 24.0 are the 500 361 sequences that define the manually curated 11 912 domain families of the Pfam data base [
<xref ref-type="bibr" rid="B33">33</xref>
]. Of these, 4011 Pfam families are further grouped into 458 clans – a collection of families that are thought to be evolutionarily related. There are therefore 8359 groups of homologous sequences in the Pfam-A seeds set.</p>
<p>A pair of sequences is considered to be incorrectly grouped together in one cluster if the sequences belong to different Pfam families and to different clans. We call such a cluster corrupted.</p>
<p>All benchmarked methods produce an order of magnitude more clusters than the possible 8359 groups (s. Table 
<xref ref-type="table" rid="T1">1</xref>
). The reason is that many homologous relationships are hard to detect because of the low sequence similarities within many Pfam seed alignments. All methods generate only few corrupted clusters, with the exception of CD-HIT. This is not surprising, since this dataset contains only single-domain proteins, yet the most false positive cluster assignments originate from mistaking local similarities as global.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption>
<p>Pfam clustering results</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="left" valign="bottom"> 
<hr></hr>
</th>
<th align="center" valign="bottom">
<bold>#Clusters</bold>
<hr></hr>
</th>
<th align="center" valign="bottom">
<bold>#Corrupted clusters</bold>
<hr></hr>
</th>
<th align="center" valign="bottom">
<bold>%Corrupted clusters</bold>
<hr></hr>
</th>
<th align="center" valign="bottom">
<bold>#Wrong seqs per</bold>
<hr></hr>
</th>
<th align="center" valign="bottom">
<bold>Time</bold>
<hr></hr>
</th>
</tr>
<tr>
<th align="left"> </th>
<th align="left"> </th>
<th align="left"> </th>
<th align="left"> </th>
<th align="center">
<bold>corrupted cluster</bold>
</th>
<th align="left"> </th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="bottom">BLAST
<hr></hr>
</td>
<td align="center" valign="bottom">118 920
<hr></hr>
</td>
<td align="center" valign="bottom">5
<hr></hr>
</td>
<td align="center" valign="bottom">4.2e-3
<hr></hr>
</td>
<td align="center" valign="bottom">2.4
<hr></hr>
</td>
<td align="center" valign="bottom">66 h 2 m
<sup></sup>
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">kClust iter
<hr></hr>
</td>
<td align="center" valign="bottom">111 251
<hr></hr>
</td>
<td align="center" valign="bottom">4
<hr></hr>
</td>
<td align="center" valign="bottom">5.2e-3
<hr></hr>
</td>
<td align="center" valign="bottom">4.0
<hr></hr>
</td>
<td align="center" valign="bottom">28 m/1 h 47 m
<sup></sup>
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">kClust sens
<hr></hr>
</td>
<td align="center" valign="bottom">153 721
<hr></hr>
</td>
<td align="center" valign="bottom">8
<hr></hr>
</td>
<td align="center" valign="bottom">5.2e-3
<hr></hr>
</td>
<td align="center" valign="bottom">1.6
<hr></hr>
</td>
<td align="center" valign="bottom">17 m
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">kClust 20
<hr></hr>
</td>
<td align="center" valign="bottom">153 883
<hr></hr>
</td>
<td align="center" valign="bottom">8
<hr></hr>
</td>
<td align="center" valign="bottom">5.1e-3
<hr></hr>
</td>
<td align="center" valign="bottom">1.6
<hr></hr>
</td>
<td align="center" valign="bottom">16 m
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">kClust 30
<hr></hr>
</td>
<td align="center" valign="bottom">169 533
<hr></hr>
</td>
<td align="center" valign="bottom">5
<hr></hr>
</td>
<td align="center" valign="bottom">2.9e-3
<hr></hr>
</td>
<td align="center" valign="bottom">1.6
<hr></hr>
</td>
<td align="center" valign="bottom">16 m
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">UCLUST 30
<hr></hr>
</td>
<td align="center" valign="bottom">234 039
<hr></hr>
</td>
<td align="center" valign="bottom">10
<hr></hr>
</td>
<td align="center" valign="bottom">4.3e-3
<hr></hr>
</td>
<td align="center" valign="bottom">1.0
<hr></hr>
</td>
<td align="center" valign="bottom">2 m
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">UCLUST 40
<hr></hr>
</td>
<td align="center" valign="bottom">244 568
<hr></hr>
</td>
<td align="center" valign="bottom">8
<hr></hr>
</td>
<td align="center" valign="bottom">3.2e-3
<hr></hr>
</td>
<td align="center" valign="bottom">1.0
<hr></hr>
</td>
<td align="center" valign="bottom">2 m
<hr></hr>
</td>
</tr>
<tr>
<td align="left">CD-HIT</td>
<td align="center">170 750</td>
<td align="center">4 086</td>
<td align="center">2.39</td>
<td align="center">1.39</td>
<td align="center">5 h 26 m</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Clustering results on PfamA-seed single-domain sequences.
<sup></sup>
The time is the sum of the run times of all 8 parallel threads.
<sup></sup>
Includes the time for the calculation of multiple sequence alignments.</p>
</table-wrap-foot>
</table-wrap>
<p>Two iterations kClust achieve the highest sensitivity, i.e., the lowest number of clusters. The second iteration of kClust reduces the number of clusters in the first iteration almost by 30% without increasing the number of corrupted clusters. BLAST produces slightly more clusters with a comparable error rate. However, there is a striking difference in the run times – two iterations kClust take 28 min while BLAST clustering takes 66 hours. UCLUST is the fastest method, but it clusters less effectively 1.5 times more clusters than kClust and twice more than BLAST. CD-HIT produces less clusters than UCLUST, but with the highest error rate and is slow.</p>
</sec>
<sec>
<title>Clustering artificial two-domain sequences</title>
<p>Most eukaryotic sequences consist of two or more structural domains. Multi-domain sequence pose greater challenges to clustering algorithms because they tend to cluster sequences together if they are sufficiently similar within one domain, even though their other domains may be unrelated.</p>
<p>Currently, most protein sequences are only partly or not annotated with contained domains. To test the ability of the tools to correctly cluster multi-domain sequences, we need a dataset of multi-domain sequences where all the sequences are annotated over their entire length. It should contain a sufficient number of sequences with the same domain composition in order to form enough clusters and also enough sequences with only local similarities, so it would be not trivial. Currently, there is no real dataset that would meet these criteria.</p>
<p>To test the ability of the tools to correctly cluster multi-domain sequences, we generated an artificial dataset of 100 000 two-domain proteins. We used the SCOP25 database [
<xref ref-type="bibr" rid="B34">34</xref>
] (SCOP database filtered to maximum 25% pairwise similarity), downloaded from the ASTRAL server (
<ext-link ext-link-type="uri" xlink:href="http://astral.berkeley.edu/">http://astral.berkeley.edu/</ext-link>
). For each seed sequence from SCOP, we searched for homologous sequence segments using two iterations of PSI-BLAST [
<xref ref-type="bibr" rid="B9">9</xref>
] in the non-redundant database from NCBI (
<italic>nr</italic>
) and filtered the results to 30% minimum sequence identity to the seed sequence from SCOP and to 70% maximum pairwise sequence identity, using the
<monospace>hhfilter</monospace>
binary from the HH-suite package [
<xref ref-type="bibr" rid="B8">8</xref>
], in order to achieve not too low and not too high sequence similarity within the homologous group. Additionally, the homologous sequences must cover at least 90% of the seed sequence.</p>
<p>Sequences that remained after filtering were merged into groups, in accordance to the SCOP family membership of the seed sequences, one group corresponding to one SCOP family. 1000 groups containing more than 100 sequences were drawn randomly, and from the list of these groups 1000 combinations of two different SCOP families were drawn (two-domain architectures). For each such two-domain architecture, 100 two-domain sequences were generated by concatenating two randomly drawn sequences from the corresponding family groups. Additionally, we rejected a sequence if the longer domain in the artificial sequence covered more than 80% of the overall sequence length.</p>
<p>Two sequences were considered to be grouped correctly in one cluster if they have the same domain architecture. Thus, the dataset could, in principle, be clustered into 1000 clusters of homologous sequences with the same domain architecture.</p>
<p>The results of the clustering are shown in Table 
<xref ref-type="table" rid="T2">2</xref>
. Almost a quarter of the clusters CD-HIT and UCLUST generate are corrupted, although usually only by few sequences. BLAST produces 2.6% corrupted clusters. The reason for these false positives is the tendency of BLAST to extend into the unrelated region the alignment of sequences having only one domain in common. A slightly positive score over a nonhomologous segment is sufficient to extend the alignment to such length that it satisfies the alignment coverage criterion of 80%. kClust produces very reliable clusters, with 0.1% or fewer corrupted. This is likely due to the conservative alignments produced by kDP, which rarely extend beyond the homologous regions.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption>
<p>Multidomain proteins clustering results</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="left" valign="bottom"> 
<hr></hr>
</th>
<th align="center" valign="bottom">
<bold>#Clusters</bold>
<hr></hr>
</th>
<th align="center" valign="bottom">
<bold>#Corrupted clusters</bold>
<hr></hr>
</th>
<th align="center" valign="bottom">
<bold>%Corrupted clusters</bold>
<hr></hr>
</th>
<th align="center" valign="bottom">
<bold>#Wrong seq per</bold>
<hr></hr>
</th>
<th align="center" valign="bottom">
<bold>Time</bold>
<hr></hr>
</th>
</tr>
<tr>
<th align="left"> </th>
<th align="left"> </th>
<th align="left"> </th>
<th align="left"> </th>
<th align="center">
<bold>corrupted cluster</bold>
</th>
<th align="left"> </th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="bottom">BLAST
<hr></hr>
</td>
<td align="center" valign="bottom">6 977
<hr></hr>
</td>
<td align="center" valign="bottom">186
<hr></hr>
</td>
<td align="center" valign="bottom">2.66
<hr></hr>
</td>
<td align="center" valign="bottom">1.5
<hr></hr>
</td>
<td align="center" valign="bottom">3 h 21 m
<sup></sup>
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">kClust iter
<hr></hr>
</td>
<td align="center" valign="bottom">8 537
<hr></hr>
</td>
<td align="center" valign="bottom">10
<hr></hr>
</td>
<td align="center" valign="bottom">0.1
<hr></hr>
</td>
<td align="center" valign="bottom">1.1
<hr></hr>
</td>
<td align="center" valign="bottom">13 m/39 m
<sup></sup>
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">kClust sens
<hr></hr>
</td>
<td align="center" valign="bottom">17 070
<hr></hr>
</td>
<td align="center" valign="bottom">10
<hr></hr>
</td>
<td align="center" valign="bottom">0.06
<hr></hr>
</td>
<td align="center" valign="bottom">1.1
<hr></hr>
</td>
<td align="center" valign="bottom">9 m
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">kClust 20
<hr></hr>
</td>
<td align="center" valign="bottom">17 127
<hr></hr>
</td>
<td align="center" valign="bottom">10
<hr></hr>
</td>
<td align="center" valign="bottom">0.06
<hr></hr>
</td>
<td align="center" valign="bottom">1.1
<hr></hr>
</td>
<td align="center" valign="bottom">9 m
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">kClust 30
<hr></hr>
</td>
<td align="center" valign="bottom">22 047
<hr></hr>
</td>
<td align="center" valign="bottom">6
<hr></hr>
</td>
<td align="center" valign="bottom">0.03
<hr></hr>
</td>
<td align="center" valign="bottom">1.2
<hr></hr>
</td>
<td align="center" valign="bottom">9 m
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">UCLUST 30
<hr></hr>
</td>
<td align="center" valign="bottom">39 284
<hr></hr>
</td>
<td align="center" valign="bottom">10132
<hr></hr>
</td>
<td align="center" valign="bottom">25.79
<hr></hr>
</td>
<td align="center" valign="bottom">1.7
<hr></hr>
</td>
<td align="center" valign="bottom">30s
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">UCLUST 40
<hr></hr>
</td>
<td align="center" valign="bottom">50 104
<hr></hr>
</td>
<td align="center" valign="bottom">10512
<hr></hr>
</td>
<td align="center" valign="bottom">20.98
<hr></hr>
</td>
<td align="center" valign="bottom">1.4
<hr></hr>
</td>
<td align="center" valign="bottom">40s
<hr></hr>
</td>
</tr>
<tr>
<td align="left">CD-HIT</td>
<td align="center">29 163</td>
<td align="center">6 234</td>
<td align="center">21.37</td>
<td align="center">1.89</td>
<td align="center">43m</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Clustering results on a set of 100 000 two-domain sequences constructed from 1000 domain architectures with 100 sequences each.
<sup></sup>
The time is the sum of the run times of all 8 parallel threads.
<sup></sup>
Includes the time for the calculation of multiple sequence alignments.</p>
</table-wrap-foot>
</table-wrap>
<p>BLAST is clearly the most sensitive method, generating only 6977 clusters. kClust produces 17 070 in the first iteration with the maximum sensitivity setting, and one additional iteration reduces this number to 8537 without increasing the number of false positives. kClust clustered the dataset in the first iteration in about 9 min, the second iteration takes additional 4 minutes (without the time needed to generate multiple alignments of the clusters and profiles that are necessary for the second iteration). BLAST-based clustering took about 3.5 hours. UCLUST is very fast, needing less than one minute, but is much less sensitive, producing clusters with only 2.5 sequences on average, compared to kClust 20 with 5.8 on average.</p>
<p>It is noteworthy that even BLAST produces significantly more than the 1000 clusters that are theore- tically possible. The reason is that many alignments between sequences within one architecture group do not fulfill the coverage criterion, because the longest sequence is often significantly longer than many of the other members of the group, and because BLAST, like all other local alignment methods, often generates alignments that do not cover the entire homologous region.</p>
</sec>
<sec>
<title>Clustering the UniProt database</title>
<p>At the time of the benchmark calculation (June 2013), UniProt [
<xref ref-type="bibr" rid="B35">35</xref>
] contained 36 042 779 amino acid sequences with 11 786 916 970 residues and average sequence length of about 330 residues.</p>
<p>The results for the clustering of the UniProt database are shown in Table 
<xref ref-type="table" rid="T3">3</xref>
. Since clustering run time increases quadratically with the database size, only kClust and UCLUST are able to cluster UniProt down to 30% sequence identity within a few days. All-against-all single-threaded BLAST would need about 80 years for the clustering. kClust clusters UniProt in 13 days 12 hours and produces 5 663 658 clusters. UCLUST needs only 3 days 5 hours to cluster UniProt and produces 6 636 076 clusters. Both tools were run with 30% sequence identity threshold and otherwise default settings. kClust needs about 12 GB of main memory for the clustering, the memory consumption of UCLUST is about 26 GB.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption>
<p>Clustering of the UniProt database</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="left"> </th>
<th align="center">
<bold>#Clusters</bold>
</th>
<th align="center">
<bold>Time</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="bottom">BLAST (estimated)
<hr></hr>
</td>
<td align="center" valign="bottom">?
<hr></hr>
</td>
<td align="center" valign="bottom">80 y
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">kClust 30
<hr></hr>
</td>
<td align="center" valign="bottom">5 663 658
<hr></hr>
</td>
<td align="center" valign="bottom">13 d 12 h
<hr></hr>
</td>
</tr>
<tr>
<td align="left">UCLUST 30</td>
<td align="center">6 636 076</td>
<td align="center">3 d 5 h</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Results for the clustering of the UniProt database, containing 36 042 779 sequences. BLAST running time is estimated.</p>
</table-wrap-foot>
</table-wrap>
<p>We compared the quality of the clustering by checking the performance of HHblits on UniProt clustered with kClust or UCLUST, respectively (Figure 
<xref ref-type="fig" rid="F6">6</xref>
). The performance is plotted after 1, 2 and 3 HHblits iterations. HHblits performs significantly better on UniProt clustered with kClust. 2 iterations of HHblits reach the same sensitivity on as 3 iterations on UniProtKB clustered with UCLUST.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption>
<p>
<bold>Performance of HHblits on the clustered UniProtKB.</bold>
Fraction of queries with ROC5 value above the threshold on the x-axis, for one, two, and three HHblits iterations on the test set (5287 sequences from the SCOP 1.73 database). All but the last search iteration are performed against the UniProt. The last search iteration is done through a combined database containing the UniProt and the SCOP sequences. TPs are defined as pairs from the same SCOP folds, FPs as pairs from different folds, with the exception of Rossman folds and
<italic>β</italic>
propellers. The ROC5 value is the area under the ROC curve up to the 5th FP, normalized to yield a theoretical maximum of 1. The ROC5 plot is more robust to overfitting than the ROC curves.</p>
</caption>
<graphic xlink:href="1471-2105-14-248-6"></graphic>
</fig>
</sec>
<sec>
<title>kClust running times on SwissProt</title>
<p>Running time of kClust depends heavily on the desired sequence identity threshold for the clustering. For lower thresholds, in addition to lowering the sequence identity threshold, kClust generates longer similar
<italic>k</italic>
-mer lists during the prefiltering and the alignment in order to increase sensitivity. As a consequence, the generation of the lists and index table matching takes longer.</p>
<p>We performed a benchmark on SwissProt, a protein sequence database containing 540 261 sequences at the time of the benchmark calculation, and plotted kClust running times against the sequence identity threshold. The results are shown in Figure 
<xref ref-type="fig" rid="F7">7</xref>
.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption>
<p>
<bold>kClust running time vs. clustering threshold.</bold>
kClust running times dependency on sequence identity threshold, calculated on SwissProt.</p>
</caption>
<graphic xlink:href="1471-2105-14-248-7"></graphic>
</fig>
</sec>
<sec>
<title>Memory consumption of kClust and UCLUST</title>
<p>We performed a comparison of memory consumption of kClust and UCLUST for different database sizes. The results are shown in Figure 
<xref ref-type="fig" rid="F8">8</xref>
. The different datasets are randomly chosen from the UniProt, the largest dataset is the whole UniProt.</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption>
<p>Memory consumption of kClust and UCLUST for different database sizes.</p>
</caption>
<graphic xlink:href="1471-2105-14-248-8"></graphic>
</fig>
</sec>
</sec>
<sec sec-type="conclusions">
<title>Conclusions</title>
<p>With the rapidly growing numbers of protein sequences from genome sequencing and metagenomics projects, methods that can cluster huge sequence sets in a reasonable time are in great need. Tools that are sensitive enough to cluster sequences together down to ∼30 % sequence identity will be particularly useful, since at that similarity protein domains usually still have the same or very similar molecular functions, in particular if their domain architecture is conserved [
<xref ref-type="bibr" rid="B21">21</xref>
,
<xref ref-type="bibr" rid="B36">36</xref>
]. Most existing sequence clustering methods rely on BLAST or FASTA for calculating the similarities between the sequences to be clustered, which makes them too slow for many clustering tasks ahead.</p>
<p>CD-HIT is fast and works well at high sequence identity clustering threshold, but it gets impracticably slow and inaccurate below 50% sequence identity. UCLUST is a very fast alternative which has, however, limited sensitivity as demonstrated in our study. kClust is to our knowledge the only method at present that fills the need for the fast clustering of large sequence databases or metagenomics sequence with a sensitivity not far from BLAST. kClust achieves this sensitivity with its prefilter that sums up the scores of similar 6-mers, its fast 4-mer-based alignment algorithm kDP, and its iterative, profile-based clustering strategy.</p>
<p>This is to our knowledge a novel approach to solving the problem of inexact, alignment-free comparison of sequences. A somewhat related approach that is popular in next-Generation sequencing data analysis is the “spaced seed” approach. In that approach, exemplified by PatternHunter [
<xref ref-type="bibr" rid="B30">30</xref>
] or SEED [
<xref ref-type="bibr" rid="B37">37</xref>
], one looks for exact matches between the compared sequences at nonconsecutive positions. Such a pattern of non-consecutive positions is called a spaced seed. Searches with several different such spaced seed patterns are then combined. For a given maximum number of mismatches, spaced seed sets can be constructed that guarantee an exact match for at least one spaced seed. This approach is very efficient as long as the number of allowed mismatches is low, say 4 out of 100. Our approach is applicable also in a range of 70 mismatches out of 100, far below what the spaced seed approach could handle due to the exploding number of spaced seeds required.</p>
<p>We are currently working on a faster and more flexible successor software which will be parallelized to run efficiently on multi-core architectures, which offers a more powerful clustering algorithm than the incremental, greedy clustering used in kClust, CD-HIT and UCLUST, and which will also allow to perform very fast parallelized sequence searches of a set of sequences against another set of sequences or sequence profiles. Importantly, it will allow updating of clustered databases with new sequences, thus obviating the need to recluster the entire sequence set from scratch.</p>
</sec>
<sec>
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec>
<title>Authors’ contributions</title>
<p>CEM and JS designed the fast k-mer prefiltering, kDP, and kClust. CEM implemented the fast k-mer prefiltering, kDP, and noniterative kClust. MH designed and implemented the iterative kClust and the benchmarks. MH and JS wrote the manuscript. JS coordinated the project. All authors read and approved the final manuscript.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material content-type="local-data" id="S1">
<caption>
<title>Additional file 1</title>
<p>Supplementary material.</p>
</caption>
<media xlink:href="1471-2105-14-248-S1.pdf">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<sec>
<title>Acknowledgements</title>
<p>M.H. gratefully acknowledges financial support from the Deutsche Telekom-Stiftung. C.E.H. and J.S. thank Andrei Lupas, Max-Planck Institute for Developmental Biology, Tübingen, for financial support and for hosting the first phase of this project. The work was supported by a Gastprofessur grant from the Ludwig-Maximilians-Universität München to Patrick Cramer and J.S., which was financed by the Excellence Initiative of the Bundesministerium für Bildung und Forschung (BMBF).</p>
</sec>
<ref-list>
<ref id="B1">
<mixed-citation publication-type="journal">
<name>
<surname>Chubb</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Jefferys</surname>
<given-names>BR</given-names>
</name>
<name>
<surname>Sternberg</surname>
<given-names>MJE</given-names>
</name>
<name>
<surname>Kelley</surname>
<given-names>LA</given-names>
</name>
<article-title>Sequencing delivers diminishing returns for homology detection: implications for mapping the protein universe</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<issue>21</issue>
<fpage>2664</fpage>
<lpage>2671</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/content/26/21/2664.abstract">http://bioinformatics.oxfordjournals.org/content/26/21/2664.abstract</ext-link>
]</comment>
<pub-id pub-id-type="pmid">20843957</pub-id>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Jaroszewski</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Godzik</surname>
<given-names>A</given-names>
</name>
<article-title>Sequence clustering strategies improve remote homology recognitions while reducing search times</article-title>
<source>Protein Eng</source>
<year>2002</year>
<volume>15</volume>
<issue>8</issue>
<fpage>643</fpage>
<lpage>649</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://view.ncbi.nlm.nih.gov/pubmed/12364578">http://view.ncbi.nlm.nih.gov/pubmed/12364578</ext-link>
]</comment>
<pub-id pub-id-type="pmid">12364578</pub-id>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal">
<name>
<surname>Park</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Holm</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Heger</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Chothia</surname>
<given-names>C</given-names>
</name>
<article-title>RSDB: representative protein sequence databases have high information content</article-title>
<source>Bioinformatics</source>
<year>2000</year>
<volume>16</volume>
<issue>5</issue>
<fpage>458</fpage>
<lpage>464</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://view.ncbi.nlm.nih.gov/pubmed/10871268">http://view.ncbi.nlm.nih.gov/pubmed/10871268</ext-link>
]</comment>
<pub-id pub-id-type="pmid">10871268</pub-id>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="journal">
<name>
<surname>Suzek</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>H</given-names>
</name>
<name>
<surname>McGarvey</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Mazumder</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>C</given-names>
</name>
<article-title>UniRef: comprehensive and non-redundant UniProt reference clusters</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<issue>10</issue>
<fpage>1282</fpage>
<lpage>1288</lpage>
<pub-id pub-id-type="pmid">17379688</pub-id>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<name>
<surname>Rusch</surname>
<given-names>DB</given-names>
</name>
<name>
<surname>Halpern</surname>
<given-names>AL</given-names>
</name>
<name>
<surname>Sutton</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Heidelberg</surname>
<given-names>KB</given-names>
</name>
<name>
<surname>Williamson</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Yooseph</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Eisen</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Hoffman</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Remington</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Beeson</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Tran</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Baden-Tillson</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Stewart</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Thorpe</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Freeman</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Andrews-Pfannkoch</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Venter</surname>
<given-names>JE</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Kravitz</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Heidelberg</surname>
<given-names>JF</given-names>
</name>
<name>
<surname>Utterback</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Rogers</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Falcón</surname>
<given-names>LI</given-names>
</name>
<name>
<surname>Souza</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Bonilla-Rosso</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Eguiarte</surname>
<given-names>LE</given-names>
</name>
<name>
<surname>Karl</surname>
<given-names>DM</given-names>
</name>
<name>
<surname>Sathyendranath</surname>
<given-names>S</given-names>
</name>
<etal></etal>
<article-title>The Sorcerer II global ocean sampling expedition: Northwest Atlantic through Eastern Tropical Pacific</article-title>
<source>PLoS Biol</source>
<year>2007</year>
<volume>5</volume>
<issue>3</issue>
<fpage>e77</fpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1371/journal.pbio.0050077">http://dx.doi.org/10.1371/journal.pbio.0050077</ext-link>
]</comment>
<pub-id pub-id-type="pmid">17355176</pub-id>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="journal">
<collab>Human Microbiome Jumpstart Reference Strains Consortium</collab>
<article-title>A catalog of reference genomes from the human microbiome</article-title>
<source>Science</source>
<year>2010</year>
<volume>328</volume>
<issue>5981</issue>
<fpage>994</fpage>
<lpage>999</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1126/science.1183605">http://dx.doi.org/10.1126/science.1183605</ext-link>
]</comment>
<pub-id pub-id-type="pmid">20489017</pub-id>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="journal">
<name>
<surname>Qin</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Raes</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Arumugam</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Burgdorf</surname>
<given-names>KS</given-names>
</name>
<name>
<surname>Manichanh</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Nielsen</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Pons</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Levenez</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Yamada</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Mende</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Cao</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Liang</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Zheng</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Xie</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Tap</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Lepage</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Bertalan</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Batto</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Hansen</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Le Paslier</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Linneberg</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Nielsen</surname>
<given-names>HB</given-names>
</name>
<name>
<surname>Pelletier</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Renault</surname>
<given-names>P</given-names>
</name>
<name>
<surname>et</surname>
<given-names>al</given-names>
</name>
<article-title>A human gut microbial gene catalogue established by metagenomic sequencing</article-title>
<source>Nature</source>
<year>2010</year>
<volume>464</volume>
<issue>7285</issue>
<fpage>59</fpage>
<lpage>65</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1038/nature08821">http://dx.doi.org/10.1038/nature08821</ext-link>
]</comment>
<pub-id pub-id-type="pmid">20203603</pub-id>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="journal">
<name>
<surname>Remmert</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Biegert</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Hauser</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Söding</surname>
<given-names>J</given-names>
</name>
<article-title>HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment</article-title>
<source>Nat Methods</source>
<year>2011</year>
<volume>9</volume>
<issue>2</issue>
<fpage>173</fpage>
<lpage>175</lpage>
<pub-id pub-id-type="pmid">22198341</pub-id>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="journal">
<name>
<surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name>
<surname>Madden</surname>
<given-names>TL</given-names>
</name>
<name>
<surname>Schäffer</surname>
<given-names>AA</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
<article-title>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs</article-title>
<source>Nucleic Acids Res</source>
<year>1997</year>
<volume>25</volume>
<issue>17</issue>
<fpage>3389</fpage>
<lpage>3402</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/nar/25.17.3389">http://dx.doi.org/10.1093/nar/25.17.3389</ext-link>
]</comment>
<pub-id pub-id-type="pmid">9254694</pub-id>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal">
<name>
<surname>Pearson</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Lipman</surname>
<given-names>D</given-names>
</name>
<article-title>Improved tools for biological sequence comparison</article-title>
<source>Proc Natl Acad Sci U S A</source>
<year>1988</year>
<volume>85</volume>
<issue>8</issue>
<fpage>2444</fpage>
<lpage>2448</lpage>
<pub-id pub-id-type="pmid">3162770</pub-id>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="journal">
<name>
<surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name>
<surname>Gish</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>EW</given-names>
</name>
<name>
<surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
<article-title>Basic local alignment search tool</article-title>
<source>J Mol Biol</source>
<year>1990</year>
<volume>215</volume>
<issue>3</issue>
<fpage>403</fpage>
<lpage>410</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1006/jmbi.1990.9999">http://dx.doi.org/10.1006/jmbi.1990.9999</ext-link>
]</comment>
<pub-id pub-id-type="pmid">2231712</pub-id>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="journal">
<name>
<surname>Yona</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Linial</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Linial</surname>
<given-names>M</given-names>
</name>
<article-title>ProtoMap: automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space</article-title>
<source>Proteins</source>
<year>1999</year>
<volume>37</volume>
<issue>3</issue>
<fpage>360</fpage>
<lpage>378</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://www.hubmed.org/display.cgi?uids=10591097">http://www.hubmed.org/display.cgi?uids=10591097</ext-link>
]</comment>
<pub-id pub-id-type="pmid">10591097</pub-id>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal">
<name>
<surname>Krause</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Stoye</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Vingron</surname>
<given-names>M</given-names>
</name>
<article-title>The SYSTERS protein sequence cluster set</article-title>
<source>Nucleic Acids Res</source>
<year>2000</year>
<volume>28</volume>
<fpage>270</fpage>
<lpage>272</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC102384/">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC102384/</ext-link>
]</comment>
<pub-id pub-id-type="pmid">10592244</pub-id>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="journal">
<name>
<surname>Miele</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Penel</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Duret</surname>
<given-names>L</given-names>
</name>
<article-title>Ultra-fast sequence clustering from similarity networks with SiLiX</article-title>
<source>BMC Bioinformatics</source>
<year>2011</year>
<volume>12</volume>
<fpage>116</fpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1186/1471-2105-12-116">http://dx.doi.org/10.1186/1471-2105-12-116</ext-link>
]</comment>
<pub-id pub-id-type="pmid">21513511</pub-id>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="journal">
<name>
<surname>Rappoport</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Karsenty</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Stern</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Linial</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Linial</surname>
<given-names>M</given-names>
</name>
<article-title>ProtoNet 6.0: organizing 10 million protein sequences in a compact hierarchical family tree</article-title>
<source>Nucleic Acids Res</source>
<year>2012</year>
<volume>40</volume>
<issue>D1</issue>
<fpage>D313</fpage>
<lpage>D320</lpage>
<pub-id pub-id-type="pmid">22121228</pub-id>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal">
<name>
<surname>Remm</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Storm</surname>
<given-names>CE</given-names>
</name>
<name>
<surname>Sonnhammer</surname>
<given-names>EL</given-names>
</name>
<article-title>Automatic clustering of orthologs and in-paralogs from pairwise species comparisons</article-title>
<source>J Mol Biol</source>
<year>2001</year>
<volume>314</volume>
<issue>5</issue>
<fpage>1041</fpage>
<lpage>1052</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1006/jmbi.2000.5197">http://dx.doi.org/10.1006/jmbi.2000.5197</ext-link>
]</comment>
<pub-id pub-id-type="pmid">11743721</pub-id>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="journal">
<name>
<surname>Enright</surname>
<given-names>AJ</given-names>
</name>
<name>
<surname>Van Dongen</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Ouzounis</surname>
<given-names>CA</given-names>
</name>
<article-title>An efficient algorithm for large-scale detection of protein families</article-title>
<source>Nucleic Acids Res</source>
<year>2002</year>
<volume>30</volume>
<issue>7</issue>
<fpage>1575</fpage>
<lpage>1584</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/nar/30.7.1575">http://dx.doi.org/10.1093/nar/30.7.1575</ext-link>
]</comment>
<pub-id pub-id-type="pmid">11917018</pub-id>
</mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="journal">
<name>
<surname>Tatusov</surname>
<given-names>RL</given-names>
</name>
<name>
<surname>Fedorova</surname>
<given-names>ND</given-names>
</name>
<name>
<surname>Jackson</surname>
<given-names>JD</given-names>
</name>
<name>
<surname>Jacobs</surname>
<given-names>AR</given-names>
</name>
<name>
<surname>Kiryutin</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Koonin</surname>
<given-names>EV</given-names>
</name>
<name>
<surname>Krylov</surname>
<given-names>DM</given-names>
</name>
<name>
<surname>Mazumder</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Mekhedov</surname>
<given-names>SL</given-names>
</name>
<name>
<surname>Nikolskaya</surname>
<given-names>AN</given-names>
</name>
<name>
<surname>Rao</surname>
<given-names>BS</given-names>
</name>
<name>
<surname>Smirnov</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sverdlov</surname>
<given-names>AV</given-names>
</name>
<name>
<surname>Vasudevan</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Wolf</surname>
<given-names>YI</given-names>
</name>
<name>
<surname>Yin</surname>
<given-names>JJ</given-names>
</name>
<name>
<surname>Natale</surname>
<given-names>DA</given-names>
</name>
<article-title>The COG database: an updated version includes eukaryotes</article-title>
<source>BMC Bioinformatics</source>
<year>2003</year>
<volume>4</volume>
<fpage>41</fpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1186/1471-2105-4-41">http://dx.doi.org/10.1186/1471-2105-4-41</ext-link>
]</comment>
<pub-id pub-id-type="pmid">12969510</pub-id>
</mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Stoeckert</surname>
<given-names>CJ</given-names>
</name>
<name>
<surname>Roos</surname>
<given-names>DS</given-names>
</name>
<article-title>OrthoMCL: identification of ortholog groups for eukaryotic genomes</article-title>
<source>Genome Res</source>
<year>2003</year>
<volume>13</volume>
<issue>9</issue>
<fpage>2178</fpage>
<lpage>2189</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1101/gr.1224503">http://dx.doi.org/10.1101/gr.1224503</ext-link>
]</comment>
<pub-id pub-id-type="pmid">12952885</pub-id>
</mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="journal">
<name>
<surname>Alexeyenko</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Tamas</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Sonnhammer</surname>
<given-names>EL</given-names>
</name>
<article-title>Automatic clustering of orthologs and inparalogs shared by multiple proteomes</article-title>
<source>Bioinformatics</source>
<year>2006</year>
<volume>22</volume>
<issue>14</issue>
<fpage>e9</fpage>
<lpage>e15</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/bioinformatics/btl213">http://dx.doi.org/10.1093/bioinformatics/btl213</ext-link>
]</comment>
<pub-id pub-id-type="pmid">16873526</pub-id>
</mixed-citation>
</ref>
<ref id="B21">
<mixed-citation publication-type="journal">
<name>
<surname>Chen</surname>
<given-names>TW</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>TH</given-names>
</name>
<name>
<surname>Ng</surname>
<given-names>WV</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>WC</given-names>
</name>
<article-title>DODO: an efficient orthologous genes assignment tool based on domain architectures. Domain based ortholog detection</article-title>
<source>BMC Bioinformatics</source>
<year>2010</year>
<volume>11</volume>
<issue>Suppl 7</issue>
<fpage>S6</fpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1186/1471-2105-11-S7-S6">http://dx.doi.org/10.1186/1471-2105-11-S7-S6</ext-link>
]</comment>
<pub-id pub-id-type="pmid">21106128</pub-id>
</mixed-citation>
</ref>
<ref id="B22">
<mixed-citation publication-type="journal">
<name>
<surname>Powell</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Szklarczyk</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Trachana</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Roth</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Kuhn</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Muller</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Arnold</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Rattei</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Letunic</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Doerks</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Jensen</surname>
<given-names>LJ</given-names>
</name>
<name>
<surname>von Mering</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Bork</surname>
<given-names>P</given-names>
</name>
<article-title>eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges</article-title>
<source>Nucleic Acids Res</source>
<year>2012</year>
<volume>40</volume>
<issue>Database issue</issue>
<fpage>284</fpage>
<lpage>289</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://www.hubmed.org/display.cgi?uids=22096231">http://www.hubmed.org/display.cgi?uids=22096231</ext-link>
]</comment>
<pub-id pub-id-type="pmid">21896618</pub-id>
</mixed-citation>
</ref>
<ref id="B23">
<mixed-citation publication-type="journal">
<name>
<surname>Pearson</surname>
<given-names>WR</given-names>
</name>
<article-title>Effective protein sequence comparison</article-title>
<source>Methods Enzymol</source>
<year>1996</year>
<volume>266</volume>
<fpage>227</fpage>
<lpage>258</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://www.hubmed.org/display.cgi?uids=8743688">http://www.hubmed.org/display.cgi?uids=8743688</ext-link>
]</comment>
<pub-id pub-id-type="pmid">8743688</pub-id>
</mixed-citation>
</ref>
<ref id="B24">
<mixed-citation publication-type="journal">
<name>
<surname>Rattei</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Tischler</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Götz</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Jehl</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Hoser</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Arnold</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Conesa</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Mewes</surname>
<given-names>HW</given-names>
</name>
<article-title>SIMAP–a comprehensive database of pre-calculated protein sequence similarities, domains, annotations and clusters</article-title>
<source>Nucleic Acids Res</source>
<year>2010</year>
<volume>38</volume>
<issue>Database issue</issue>
<fpage>D223</fpage>
<lpage>226</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/nar/gkp949">http://dx.doi.org/10.1093/nar/gkp949</ext-link>
]</comment>
<pub-id pub-id-type="pmid">19906725</pub-id>
</mixed-citation>
</ref>
<ref id="B25">
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Jaroszewski</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Godzik</surname>
<given-names>A</given-names>
</name>
<article-title>Tolerating some redundancy significantly speeds up clustering of large protein databases</article-title>
<source>Bioinformatics</source>
<year>2002</year>
<volume>18</volume>
<fpage>77</fpage>
<lpage>82</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://view.ncbi.nlm.nih.gov/pubmed/11836214">http://view.ncbi.nlm.nih.gov/pubmed/11836214</ext-link>
]</comment>
<pub-id pub-id-type="pmid">11836214</pub-id>
</mixed-citation>
</ref>
<ref id="B26">
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Godzik</surname>
<given-names>A</given-names>
</name>
<article-title>CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences</article-title>
<source>Bioinformatics</source>
<year>2006</year>
<volume>22</volume>
<issue>13</issue>
<fpage>1658</fpage>
<lpage>1659</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/bioinformatics/btl158">http://dx.doi.org/10.1093/bioinformatics/btl158</ext-link>
]</comment>
<pub-id pub-id-type="pmid">16731699</pub-id>
</mixed-citation>
</ref>
<ref id="B27">
<mixed-citation publication-type="journal">
<name>
<surname>Fu</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Niu</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>W</given-names>
</name>
<article-title>CD-HIT: accelerated for clustering the next-generation sequencing data</article-title>
<source>Bioinformatics</source>
<year>2012</year>
<volume>28</volume>
<issue>23</issue>
<fpage>3150</fpage>
<lpage>3152</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dblp.uni-trier.de/db/journals/bioinformatics/bioinformatics28.html#FuNZWL12">http://dblp.uni-trier.de/db/journals/bioinformatics/bioinformatics28.html#FuNZWL12</ext-link>
]</comment>
<pub-id pub-id-type="pmid">23060610</pub-id>
</mixed-citation>
</ref>
<ref id="B28">
<mixed-citation publication-type="journal">
<name>
<surname>Edgar</surname>
<given-names>RC</given-names>
</name>
<article-title>Search and clustering orders of magnitude faster than BLAST</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<issue>19</issue>
<fpage>2460</fpage>
<lpage>2461</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/bioinformatics/btq461">http://dx.doi.org/10.1093/bioinformatics/btq461</ext-link>
]</comment>
<pub-id pub-id-type="pmid">20709691</pub-id>
</mixed-citation>
</ref>
<ref id="B29">
<mixed-citation publication-type="journal">
<name>
<surname>Hobohm</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Scharf</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Schneider</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Sander</surname>
<given-names>C</given-names>
</name>
<article-title>Selection of representative protein data sets</article-title>
<source>Protein Sci</source>
<year>1992</year>
<volume>1</volume>
<issue>3</issue>
<fpage>409</fpage>
<lpage>417</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://www.hubmed.org/display.cgi?uids=1304348">http://www.hubmed.org/display.cgi?uids=1304348</ext-link>
]</comment>
<pub-id pub-id-type="pmid">1304348</pub-id>
</mixed-citation>
</ref>
<ref id="B30">
<mixed-citation publication-type="journal">
<name>
<surname>Ma</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Tromp</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>M</given-names>
</name>
<article-title>PatternHunter: faster and more sensitive homology search</article-title>
<source>Bioinformatics</source>
<year>2002</year>
<volume>18</volume>
<issue>3</issue>
<fpage>440</fpage>
<lpage>445</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/bioinformatics/18.3.440">http://dx.doi.org/10.1093/bioinformatics/18.3.440</ext-link>
]</comment>
<pub-id pub-id-type="pmid">11934743</pub-id>
</mixed-citation>
</ref>
<ref id="B31">
<mixed-citation publication-type="book">
<name>
<surname>Mayer</surname>
<given-names>CE</given-names>
</name>
<source>Fast method for sequence comparison and application to database clustering</source>
<year>2007</year>
<publisher-name>Tuebingen, Univ.: Master thesis</publisher-name>
</mixed-citation>
</ref>
<ref id="B32">
<mixed-citation publication-type="journal">
<name>
<surname>Przybylski</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Rost</surname>
<given-names>B</given-names>
</name>
<article-title>Consensus sequences improve PSI-BLAST through mimicking profile-profile alignments</article-title>
<source>Nucleic Acids Res</source>
<year>2007</year>
<volume>35</volume>
<issue>7</issue>
<fpage>2238</fpage>
<lpage>2246</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/nar/gkm107">http://dx.doi.org/10.1093/nar/gkm107</ext-link>
]</comment>
<pub-id pub-id-type="pmid">17369271</pub-id>
</mixed-citation>
</ref>
<ref id="B33">
<mixed-citation publication-type="journal">
<name>
<surname>Finn</surname>
<given-names>RD</given-names>
</name>
<name>
<surname>Mistry</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Tate</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Coggill</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Heger</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Pollington</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Gavin</surname>
<given-names>OL</given-names>
</name>
<name>
<surname>Gunasekaran</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Ceric</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Forslund</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Holm</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Sonnhammer</surname>
<given-names>EL</given-names>
</name>
<name>
<surname>Eddy</surname>
<given-names>SR</given-names>
</name>
<name>
<surname>Bateman</surname>
<given-names>A</given-names>
</name>
<article-title>The Pfam protein families database</article-title>
<source>Nucleic Acids Res</source>
<year>2010</year>
<volume>38</volume>
<issue>Database issue</issue>
<fpage>D211</fpage>
<lpage>D222</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/nar/gkp985">http://dx.doi.org/10.1093/nar/gkp985</ext-link>
]</comment>
<pub-id pub-id-type="pmid">19920124</pub-id>
</mixed-citation>
</ref>
<ref id="B34">
<mixed-citation publication-type="journal">
<name>
<surname>Lo Conte</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Ailey</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Hubbard</surname>
<given-names>TJ</given-names>
</name>
<name>
<surname>Brenner</surname>
<given-names>SE</given-names>
</name>
<name>
<surname>Murzin</surname>
<given-names>AG</given-names>
</name>
<name>
<surname>Chothia</surname>
<given-names>C</given-names>
</name>
<article-title>SCOP: a structural classification of proteins database</article-title>
<source>Nucleic Acids Res</source>
<year>2000</year>
<volume>28</volume>
<fpage>257</fpage>
<lpage>259</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://view.ncbi.nlm.nih.gov/pubmed/10592240">http://view.ncbi.nlm.nih.gov/pubmed/10592240</ext-link>
]</comment>
<pub-id pub-id-type="pmid">10592240</pub-id>
</mixed-citation>
</ref>
<ref id="B35">
<mixed-citation publication-type="journal">
<name>
<surname>Apweiler</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Bairoch</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>CH</given-names>
</name>
<name>
<surname>Barker</surname>
<given-names>WC</given-names>
</name>
<name>
<surname>Boeckmann</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Ferro</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Gasteiger</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Lopez</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Magrane</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Martin</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Natale</surname>
<given-names>DA</given-names>
</name>
<name>
<surname>O’Donovan</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Redaschi</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Yeh</surname>
<given-names>LS</given-names>
</name>
<article-title>UniProt: the Universal Protein knowledgebase</article-title>
<source>Nucleic Acids Res</source>
<year>2004</year>
<volume>32</volume>
<issue>Database issue</issue>
<fpage>D115</fpage>
<lpage>D119</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/nar/gkh131">http://dx.doi.org/10.1093/nar/gkh131</ext-link>
]</comment>
<pub-id pub-id-type="pmid">14681372</pub-id>
</mixed-citation>
</ref>
<ref id="B36">
<mixed-citation publication-type="journal">
<name>
<surname>Hegyi</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Gerstein</surname>
<given-names>M</given-names>
</name>
<article-title>Annotation transfer for genomics: measuring functional divergence in multi-domain proteins</article-title>
<source>Genome Res</source>
<year>2001</year>
<volume>11</volume>
<issue>10</issue>
<fpage>1632</fpage>
<lpage>1640</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1101/gr.183801">http://dx.doi.org/10.1101/gr.183801</ext-link>
]</comment>
<pub-id pub-id-type="pmid">11591640</pub-id>
</mixed-citation>
</ref>
<ref id="B37">
<mixed-citation publication-type="journal">
<name>
<surname>Bao</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Kaloshian</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Girke</surname>
<given-names>T</given-names>
</name>
<article-title>SEED: efficient clustering of next-generation sequences</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>18</issue>
<fpage>2502</fpage>
<lpage>2509</lpage>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/content/27/18/2502.abstract">http://bioinformatics.oxfordjournals.org/content/27/18/2502.abstract</ext-link>
]</comment>
<pub-id pub-id-type="pmid">21810899</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000934 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000934 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:3843501
   |texte=   kClust: fast and sensitive clustering of large protein sequence databases
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:23945046" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021