Local homology recognition and distance measures in linear time using compressed amino acid alphabets.
Identifieur interne : 000258 ( Ncbi/Checkpoint ); précédent : 000257; suivant : 000259Local homology recognition and distance measures in linear time using compressed amino acid alphabets.
Auteurs : Robert C. EdgarSource :
- Nucleic acids research [ 1362-4962 ] ; 2004.
Descripteurs français
- KwdFr :
- MESH :
English descriptors
- KwdEn :
- MESH :
- chemical , analysis : Amino Acids.
- chemical , chemistry : Proteins.
- methods : Computational Biology, Sequence Alignment.
- Algorithms, Evolution, Molecular, Molecular Sequence Data, Phylogeny, Sequence Homology, Amino Acid, Software, Time Factors.
Abstract
Methods for discovery of local similarities and estimation of evolutionary distance by identifying k-mers (contiguous subsequences of length k) common to two sequences are described. Given unaligned sequences of length L, these methods have O(L) time complexity. The ability of compressed amino acid alphabets to extend these techniques to distantly related proteins was investigated. The performance of these algorithms was evaluated for different alphabets and choices of k using a test set of 1848 pairs of structurally alignable sequences selected from the FSSP database. Distance measures derived from k-mer counting were found to correlate well with percentage identity derived from sequence alignments. Compressed alphabets were seen to improve performance in local similarity discovery, but no evidence was found of improvements when applied to distance estimates. The performance of our local similarity discovery method was compared with the fast Fourier transform (FFT) used in MAFFT, which has O(L log L) time complexity. The method for achieving comparable coverage to FFT is revealed here, and is more than an order of magnitude faster. We suggest using k-mer distance for fast, approximate phylogenetic tree construction, and show that a speed improvement of more than three orders of magnitude can be achieved relative to standard distance methods, which require alignments.
DOI: 10.1093/nar/gkh180
PubMed: 14729922
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream PubMed, to step Corpus: 002411
- to stream PubMed, to step Curation: 002411
- to stream PubMed, to step Checkpoint: 002252
- to stream Ncbi, to step Merge: 000258
- to stream Ncbi, to step Curation: 000258
Links to Exploration step
pubmed:14729922Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Local homology recognition and distance measures in linear time using compressed amino acid alphabets.</title>
<author><name sortKey="Edgar, Robert C" sort="Edgar, Robert C" uniqKey="Edgar R" first="Robert C" last="Edgar">Robert C. Edgar</name>
<affiliation><nlm:affiliation>bob@drive5.com</nlm:affiliation>
<wicri:noCountry code="no comma">bob@drive5.com</wicri:noCountry>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PubMed</idno>
<date when="2004">2004</date>
<idno type="RBID">pubmed:14729922</idno>
<idno type="pmid">14729922</idno>
<idno type="doi">10.1093/nar/gkh180</idno>
<idno type="wicri:Area/PubMed/Corpus">002411</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">002411</idno>
<idno type="wicri:Area/PubMed/Curation">002411</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">002411</idno>
<idno type="wicri:Area/PubMed/Checkpoint">002252</idno>
<idno type="wicri:explorRef" wicri:stream="Checkpoint" wicri:step="PubMed">002252</idno>
<idno type="wicri:Area/Ncbi/Merge">000258</idno>
<idno type="wicri:Area/Ncbi/Curation">000258</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">000258</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">Local homology recognition and distance measures in linear time using compressed amino acid alphabets.</title>
<author><name sortKey="Edgar, Robert C" sort="Edgar, Robert C" uniqKey="Edgar R" first="Robert C" last="Edgar">Robert C. Edgar</name>
<affiliation><nlm:affiliation>bob@drive5.com</nlm:affiliation>
<wicri:noCountry code="no comma">bob@drive5.com</wicri:noCountry>
</affiliation>
</author>
</analytic>
<series><title level="j">Nucleic acids research</title>
<idno type="eISSN">1362-4962</idno>
<imprint><date when="2004" type="published">2004</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Algorithms</term>
<term>Amino Acids (analysis)</term>
<term>Computational Biology (methods)</term>
<term>Evolution, Molecular</term>
<term>Molecular Sequence Data</term>
<term>Phylogeny</term>
<term>Proteins (chemistry)</term>
<term>Sequence Alignment (methods)</term>
<term>Sequence Homology, Amino Acid</term>
<term>Software</term>
<term>Time Factors</term>
</keywords>
<keywords scheme="KwdFr" xml:lang="fr"><term>Acides aminés (analyse)</term>
<term>Algorithmes</term>
<term>Alignement de séquences ()</term>
<term>Biologie informatique ()</term>
<term>Données de séquences moléculaires</term>
<term>Facteurs temps</term>
<term>Logiciel</term>
<term>Phylogénie</term>
<term>Protéines ()</term>
<term>Similitude de séquences d'acides aminés</term>
<term>Évolution moléculaire</term>
</keywords>
<keywords scheme="MESH" type="chemical" qualifier="analysis" xml:lang="en"><term>Amino Acids</term>
</keywords>
<keywords scheme="MESH" type="chemical" qualifier="chemistry" xml:lang="en"><term>Proteins</term>
</keywords>
<keywords scheme="MESH" qualifier="analyse" xml:lang="fr"><term>Acides aminés</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en"><term>Computational Biology</term>
<term>Sequence Alignment</term>
</keywords>
<keywords scheme="MESH" xml:lang="en"><term>Algorithms</term>
<term>Evolution, Molecular</term>
<term>Molecular Sequence Data</term>
<term>Phylogeny</term>
<term>Sequence Homology, Amino Acid</term>
<term>Software</term>
<term>Time Factors</term>
</keywords>
<keywords scheme="MESH" xml:lang="fr"><term>Algorithmes</term>
<term>Alignement de séquences</term>
<term>Biologie informatique</term>
<term>Données de séquences moléculaires</term>
<term>Facteurs temps</term>
<term>Logiciel</term>
<term>Phylogénie</term>
<term>Protéines</term>
<term>Similitude de séquences d'acides aminés</term>
<term>Évolution moléculaire</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Methods for discovery of local similarities and estimation of evolutionary distance by identifying k-mers (contiguous subsequences of length k) common to two sequences are described. Given unaligned sequences of length L, these methods have O(L) time complexity. The ability of compressed amino acid alphabets to extend these techniques to distantly related proteins was investigated. The performance of these algorithms was evaluated for different alphabets and choices of k using a test set of 1848 pairs of structurally alignable sequences selected from the FSSP database. Distance measures derived from k-mer counting were found to correlate well with percentage identity derived from sequence alignments. Compressed alphabets were seen to improve performance in local similarity discovery, but no evidence was found of improvements when applied to distance estimates. The performance of our local similarity discovery method was compared with the fast Fourier transform (FFT) used in MAFFT, which has O(L log L) time complexity. The method for achieving comparable coverage to FFT is revealed here, and is more than an order of magnitude faster. We suggest using k-mer distance for fast, approximate phylogenetic tree construction, and show that a speed improvement of more than three orders of magnitude can be achieved relative to standard distance methods, which require alignments.</div>
</front>
</TEI>
<affiliations><list></list>
<tree><noCountry><name sortKey="Edgar, Robert C" sort="Edgar, Robert C" uniqKey="Edgar R" first="Robert C" last="Edgar">Robert C. Edgar</name>
</noCountry>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Ncbi/Checkpoint
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000258 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Ncbi/Checkpoint/biblio.hfd -nk 000258 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Sante |area= MersV1 |flux= Ncbi |étape= Checkpoint |type= RBID |clé= pubmed:14729922 |texte= Local homology recognition and distance measures in linear time using compressed amino acid alphabets. }}
Pour générer des pages wiki
HfdIndexSelect -h $EXPLOR_AREA/Data/Ncbi/Checkpoint/RBID.i -Sk "pubmed:14729922" \ | HfdSelect -Kh $EXPLOR_AREA/Data/Ncbi/Checkpoint/biblio.hfd \ | NlmPubMed2Wicri -a MersV1
This area was generated with Dilib version V0.6.33. |