Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

kClust: fast and sensitive clustering of large protein sequence databases.

Identifieur interne : 001A23 ( PubMed/Checkpoint ); précédent : 001A22; suivant : 001A24

kClust: fast and sensitive clustering of large protein sequence databases.

Auteurs : Maria Hauser [Allemagne] ; Christian E. Mayer ; Johannes Söding

Source :

RBID : pubmed:23945046

Descripteurs français

English descriptors

Abstract

Fueled by rapid progress in high-throughput sequencing, the size of public sequence databases doubles every two years. Searching the ever larger and more redundant databases is getting increasingly inefficient. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed, sensitivity, and readability of homology searches. However, because the clustering time is quadratic in the number of sequences, standard sequence search methods are becoming impracticable.

DOI: 10.1186/1471-2105-14-248
PubMed: 23945046


Affiliations:


Links toward previous steps (curation, corpus...)


Links to Exploration step

pubmed:23945046

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">kClust: fast and sensitive clustering of large protein sequence databases.</title>
<author>
<name sortKey="Hauser, Maria" sort="Hauser, Maria" uniqKey="Hauser M" first="Maria" last="Hauser">Maria Hauser</name>
<affiliation wicri:level="4">
<nlm:affiliation>Gene Center and Center for Integrated Protein Science (CIPSM), Ludwig-Maximilians-Universität München, Feodor-Lynen-Str, 25, Munich 81377, Germany. soeding@genzentrum.lmu.de.</nlm:affiliation>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Gene Center and Center for Integrated Protein Science (CIPSM), Ludwig-Maximilians-Universität München, Feodor-Lynen-Str, 25, Munich 81377</wicri:regionArea>
<placeName>
<region type="land" nuts="1">Bavière</region>
<region type="district" nuts="2">District de Haute-Bavière</region>
<settlement type="city">Munich</settlement>
</placeName>
<orgName type="university">Université Louis-et-Maximilien de Munich</orgName>
</affiliation>
</author>
<author>
<name sortKey="Mayer, Christian E" sort="Mayer, Christian E" uniqKey="Mayer C" first="Christian E" last="Mayer">Christian E. Mayer</name>
</author>
<author>
<name sortKey="Soding, Johannes" sort="Soding, Johannes" uniqKey="Soding J" first="Johannes" last="Söding">Johannes Söding</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2013">2013</date>
<idno type="RBID">pubmed:23945046</idno>
<idno type="pmid">23945046</idno>
<idno type="doi">10.1186/1471-2105-14-248</idno>
<idno type="wicri:Area/PubMed/Corpus">001C19</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">001C19</idno>
<idno type="wicri:Area/PubMed/Curation">001C19</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">001C19</idno>
<idno type="wicri:Area/PubMed/Checkpoint">001A23</idno>
<idno type="wicri:explorRef" wicri:stream="Checkpoint" wicri:step="PubMed">001A23</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">kClust: fast and sensitive clustering of large protein sequence databases.</title>
<author>
<name sortKey="Hauser, Maria" sort="Hauser, Maria" uniqKey="Hauser M" first="Maria" last="Hauser">Maria Hauser</name>
<affiliation wicri:level="4">
<nlm:affiliation>Gene Center and Center for Integrated Protein Science (CIPSM), Ludwig-Maximilians-Universität München, Feodor-Lynen-Str, 25, Munich 81377, Germany. soeding@genzentrum.lmu.de.</nlm:affiliation>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Gene Center and Center for Integrated Protein Science (CIPSM), Ludwig-Maximilians-Universität München, Feodor-Lynen-Str, 25, Munich 81377</wicri:regionArea>
<placeName>
<region type="land" nuts="1">Bavière</region>
<region type="district" nuts="2">District de Haute-Bavière</region>
<settlement type="city">Munich</settlement>
</placeName>
<orgName type="university">Université Louis-et-Maximilien de Munich</orgName>
</affiliation>
</author>
<author>
<name sortKey="Mayer, Christian E" sort="Mayer, Christian E" uniqKey="Mayer C" first="Christian E" last="Mayer">Christian E. Mayer</name>
</author>
<author>
<name sortKey="Soding, Johannes" sort="Soding, Johannes" uniqKey="Soding J" first="Johannes" last="Söding">Johannes Söding</name>
</author>
</analytic>
<series>
<title level="j">BMC bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2013" type="published">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Algorithms</term>
<term>Cluster Analysis</term>
<term>Databases, Factual</term>
<term>Databases, Protein</term>
<term>Sequence Analysis, Protein (methods)</term>
<term>Software</term>
</keywords>
<keywords scheme="KwdFr" xml:lang="fr">
<term>Algorithmes</term>
<term>Analyse de regroupements</term>
<term>Analyse de séquence de protéine ()</term>
<term>Bases de données de protéines</term>
<term>Bases de données factuelles</term>
<term>Logiciel</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en">
<term>Sequence Analysis, Protein</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Algorithms</term>
<term>Cluster Analysis</term>
<term>Databases, Factual</term>
<term>Databases, Protein</term>
<term>Software</term>
</keywords>
<keywords scheme="MESH" xml:lang="fr">
<term>Algorithmes</term>
<term>Analyse de regroupements</term>
<term>Analyse de séquence de protéine</term>
<term>Bases de données de protéines</term>
<term>Bases de données factuelles</term>
<term>Logiciel</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Fueled by rapid progress in high-throughput sequencing, the size of public sequence databases doubles every two years. Searching the ever larger and more redundant databases is getting increasingly inefficient. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed, sensitivity, and readability of homology searches. However, because the clustering time is quadratic in the number of sequences, standard sequence search methods are becoming impracticable.</div>
</front>
</TEI>
<pubmed>
<MedlineCitation Status="MEDLINE" IndexingMethod="Curated" Owner="NLM">
<PMID Version="1">23945046</PMID>
<DateCompleted>
<Year>2014</Year>
<Month>06</Month>
<Day>13</Day>
</DateCompleted>
<DateRevised>
<Year>2018</Year>
<Month>12</Month>
<Day>02</Day>
</DateRevised>
<Article PubModel="Electronic">
<Journal>
<ISSN IssnType="Electronic">1471-2105</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>14</Volume>
<PubDate>
<Year>2013</Year>
<Month>Aug</Month>
<Day>15</Day>
</PubDate>
</JournalIssue>
<Title>BMC bioinformatics</Title>
<ISOAbbreviation>BMC Bioinformatics</ISOAbbreviation>
</Journal>
<ArticleTitle>kClust: fast and sensitive clustering of large protein sequence databases.</ArticleTitle>
<Pagination>
<MedlinePgn>248</MedlinePgn>
</Pagination>
<ELocationID EIdType="doi" ValidYN="Y">10.1186/1471-2105-14-248</ELocationID>
<Abstract>
<AbstractText Label="BACKGROUND" NlmCategory="BACKGROUND">Fueled by rapid progress in high-throughput sequencing, the size of public sequence databases doubles every two years. Searching the ever larger and more redundant databases is getting increasingly inefficient. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed, sensitivity, and readability of homology searches. However, because the clustering time is quadratic in the number of sequences, standard sequence search methods are becoming impracticable.</AbstractText>
<AbstractText Label="RESULTS" NlmCategory="RESULTS">Here we present a method to cluster large protein sequence databases such as UniProt within days down to 20%-30% maximum pairwise sequence identity. kClust owes its speed and sensitivity to an alignment-free prefilter that calculates the cumulative score of all similar 6-mers between pairs of sequences, and to a dynamic programming algorithm that operates on pairs of similar 4-mers. To increase sensitivity further, kClust can run in profile-sequence comparison mode, with profiles computed from the clusters of a previous kClust iteration. kClust is two to three orders of magnitude faster than clustering based on NCBI BLAST, and on multidomain sequences of 20%-30% maximum pairwise sequence identity it achieves comparable sensitivity and a lower false discovery rate. It also compares favorably to CD-HIT and UCLUST in terms of false discovery rate, sensitivity, and speed.</AbstractText>
<AbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS">kClust fills the need for a fast, sensitive, and accurate tool to cluster large protein sequence databases to below 30% sequence identity. kClust is freely available under GPL at http://toolkit.lmb.uni-muenchen.de/pub/kClust/.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Hauser</LastName>
<ForeName>Maria</ForeName>
<Initials>M</Initials>
<AffiliationInfo>
<Affiliation>Gene Center and Center for Integrated Protein Science (CIPSM), Ludwig-Maximilians-Universität München, Feodor-Lynen-Str, 25, Munich 81377, Germany. soeding@genzentrum.lmu.de.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Mayer</LastName>
<ForeName>Christian E</ForeName>
<Initials>CE</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Söding</LastName>
<ForeName>Johannes</ForeName>
<Initials>J</Initials>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
<PublicationType UI="D013485">Research Support, Non-U.S. Gov't</PublicationType>
</PublicationTypeList>
<ArticleDate DateType="Electronic">
<Year>2013</Year>
<Month>08</Month>
<Day>15</Day>
</ArticleDate>
</Article>
<MedlineJournalInfo>
<Country>England</Country>
<MedlineTA>BMC Bioinformatics</MedlineTA>
<NlmUniqueID>100965194</NlmUniqueID>
<ISSNLinking>1471-2105</ISSNLinking>
</MedlineJournalInfo>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D000465" MajorTopicYN="N">Algorithms</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D016000" MajorTopicYN="Y">Cluster Analysis</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D016208" MajorTopicYN="N">Databases, Factual</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D030562" MajorTopicYN="Y">Databases, Protein</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D020539" MajorTopicYN="N">Sequence Analysis, Protein</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="Y">methods</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D012984" MajorTopicYN="Y">Software</DescriptorName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="received">
<Year>2013</Year>
<Month>04</Month>
<Day>24</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="accepted">
<Year>2013</Year>
<Month>08</Month>
<Day>12</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez">
<Year>2013</Year>
<Month>8</Month>
<Day>16</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed">
<Year>2013</Year>
<Month>8</Month>
<Day>16</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2014</Year>
<Month>6</Month>
<Day>15</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>epublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">23945046</ArticleId>
<ArticleId IdType="pii">1471-2105-14-248</ArticleId>
<ArticleId IdType="doi">10.1186/1471-2105-14-248</ArticleId>
<ArticleId IdType="pmc">PMC3843501</ArticleId>
</ArticleIdList>
<ReferenceList>
<Reference>
<Citation>Proteins. 1999 Nov 15;37(3):360-78</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">10591097</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2012 Dec 1;28(23):3150-2</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23060610</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2000 Jan 1;28(1):270-2</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">10592244</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2000 May;16(5):458-64</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">10871268</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Res. 2001 Oct;11(10):1632-40</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">11591640</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>J Mol Biol. 2001 Dec 14;314(5):1041-52</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">11743721</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2002 Jan;18(1):77-82</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">11836214</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2002 Apr 1;30(7):1575-84</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">11917018</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2002 Mar;18(3):440-5</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">11934743</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Protein Eng. 2002 Aug;15(8):643-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">12364578</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Res. 2003 Sep;13(9):2178-89</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">12952885</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Bioinformatics. 2003 Sep 11;4:41</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">12969510</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2004 Jan 1;32(Database issue):D115-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">14681372</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444-8</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">3162770</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>J Mol Biol. 1990 Oct 5;215(3):403-10</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">2231712</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Protein Sci. 1992 Mar;1(3):409-17</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">1304348</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Methods Enzymol. 1996;266:227-58</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">8743688</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 1997 Sep 1;25(17):3389-402</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">9254694</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2006 Jul 1;22(13):1658-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">16731699</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2006 Jul 15;22(14):e9-15</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">16873526</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2007;35(7):2238-46</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">17369271</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2007 May 15;23(10):1282-8</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">17379688</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>PLoS Biol. 2007 Mar;5(3):e77</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">17355176</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2010 Jan;38(Database issue):D223-6</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">19906725</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2010 Jan;38(Database issue):D211-22</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">19920124</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nature. 2010 Mar 4;464(7285):59-65</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">20203603</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Science. 2010 May 21;328(5981):994-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">20489017</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2010 Oct 1;26(19):2460-1</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">20709691</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2010 Nov 1;26(21):2664-71</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">20843957</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Bioinformatics. 2010;11 Suppl 7:S6</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21106128</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Bioinformatics. 2011;12:116</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21513511</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2011 Sep 15;27(18):2502-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21810899</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2012 Jan;40(Database issue):D284-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22096231</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2012 Jan;40(Database issue):D313-20</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22121228</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nat Methods. 2012 Feb;9(2):173-5</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22198341</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2000 Jan 1;28(1):257-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">10592240</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
</PubmedData>
</pubmed>
<affiliations>
<list>
<country>
<li>Allemagne</li>
</country>
<region>
<li>Bavière</li>
<li>District de Haute-Bavière</li>
</region>
<settlement>
<li>Munich</li>
</settlement>
<orgName>
<li>Université Louis-et-Maximilien de Munich</li>
</orgName>
</list>
<tree>
<noCountry>
<name sortKey="Mayer, Christian E" sort="Mayer, Christian E" uniqKey="Mayer C" first="Christian E" last="Mayer">Christian E. Mayer</name>
<name sortKey="Soding, Johannes" sort="Soding, Johannes" uniqKey="Soding J" first="Johannes" last="Söding">Johannes Söding</name>
</noCountry>
<country name="Allemagne">
<region name="Bavière">
<name sortKey="Hauser, Maria" sort="Hauser, Maria" uniqKey="Hauser M" first="Maria" last="Hauser">Maria Hauser</name>
</region>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/PubMed/Checkpoint
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001A23 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PubMed/Checkpoint/biblio.hfd -nk 001A23 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    PubMed
   |étape=   Checkpoint
   |type=    RBID
   |clé=     pubmed:23945046
   |texte=   kClust: fast and sensitive clustering of large protein sequence databases.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/PubMed/Checkpoint/RBID.i   -Sk "pubmed:23945046" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/PubMed/Checkpoint/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021