Serveur d'exploration SRAS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences.

Identifieur interne : 002552 ( PubMed/Curation ); précédent : 002551; suivant : 002553

Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences.

Auteurs : Tiee-Jian Wu [Taïwan] ; Ying-Hsueh Huang ; Lung-An Li

Source :

RBID : pubmed:16144805

Descripteurs français

English descriptors

Abstract

Several measures of DNA sequence dissimilarity have been developed. The purpose of this paper is 3-fold. Firstly, we compare the performance of several word-based or alignment-based methods. Secondly, we give a general guideline for choosing the window size and determining the optimal word sizes for several word-based measures at different window sizes. Thirdly, we use a large-scale simulation method to simulate data from the distribution of SK-LD (symmetric Kullback-Leibler discrepancy). These simulated data can be used to estimate the degree of dissimilarity beta between any pair of DNA sequences.

DOI: 10.1093/bioinformatics/bti658
PubMed: 16144805

Links toward previous steps (curation, corpus...)


Links to Exploration step

pubmed:16144805

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences.</title>
<author>
<name sortKey="Wu, Tiee Jian" sort="Wu, Tiee Jian" uniqKey="Wu T" first="Tiee-Jian" last="Wu">Tiee-Jian Wu</name>
<affiliation wicri:level="1">
<nlm:affiliation>Department of Statistics, National Cheng-Kung University, Tainan, Taiwan. tjwu@stat.ncku.edu.tw</nlm:affiliation>
<country xml:lang="fr">Taïwan</country>
<wicri:regionArea>Department of Statistics, National Cheng-Kung University, Tainan</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Huang, Ying Hsueh" sort="Huang, Ying Hsueh" uniqKey="Huang Y" first="Ying-Hsueh" last="Huang">Ying-Hsueh Huang</name>
</author>
<author>
<name sortKey="Li, Lung An" sort="Li, Lung An" uniqKey="Li L" first="Lung-An" last="Li">Lung-An Li</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2005">2005</date>
<idno type="RBID">pubmed:16144805</idno>
<idno type="pmid">16144805</idno>
<idno type="doi">10.1093/bioinformatics/bti658</idno>
<idno type="wicri:Area/PubMed/Corpus">002552</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">002552</idno>
<idno type="wicri:Area/PubMed/Curation">002552</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">002552</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences.</title>
<author>
<name sortKey="Wu, Tiee Jian" sort="Wu, Tiee Jian" uniqKey="Wu T" first="Tiee-Jian" last="Wu">Tiee-Jian Wu</name>
<affiliation wicri:level="1">
<nlm:affiliation>Department of Statistics, National Cheng-Kung University, Tainan, Taiwan. tjwu@stat.ncku.edu.tw</nlm:affiliation>
<country xml:lang="fr">Taïwan</country>
<wicri:regionArea>Department of Statistics, National Cheng-Kung University, Tainan</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Huang, Ying Hsueh" sort="Huang, Ying Hsueh" uniqKey="Huang Y" first="Ying-Hsueh" last="Huang">Ying-Hsueh Huang</name>
</author>
<author>
<name sortKey="Li, Lung An" sort="Li, Lung An" uniqKey="Li L" first="Lung-An" last="Li">Lung-An Li</name>
</author>
</analytic>
<series>
<title level="j">Bioinformatics (Oxford, England)</title>
<idno type="ISSN">1367-4803</idno>
<imprint>
<date when="2005" type="published">2005</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Algorithms</term>
<term>Computational Biology (methods)</term>
<term>Computer Simulation</term>
<term>Computers</term>
<term>DNA (chemistry)</term>
<term>Databases, Genetic</term>
<term>Databases, Protein</term>
<term>Escherichia coli (genetics)</term>
<term>Genes, Bacterial</term>
<term>Genome</term>
<term>Humans</term>
<term>Lipoprotein Lipase (genetics)</term>
<term>Models, Genetic</term>
<term>Models, Statistical</term>
<term>Oligonucleotide Array Sequence Analysis</term>
<term>Oligonucleotide Probes (chemistry)</term>
<term>Open Reading Frames</term>
<term>Pattern Recognition, Automated</term>
<term>Phylogeny</term>
<term>SARS Virus (genetics)</term>
<term>Sequence Analysis, DNA (methods)</term>
<term>Shigella flexneri (genetics)</term>
<term>Software</term>
<term>Species Specificity</term>
</keywords>
<keywords scheme="KwdFr" xml:lang="fr">
<term>ADN ()</term>
<term>Algorithmes</term>
<term>Analyse de séquence d'ADN ()</term>
<term>Bases de données de protéines</term>
<term>Bases de données génétiques</term>
<term>Biologie informatique ()</term>
<term>Cadres ouverts de lecture</term>
<term>Escherichia coli (génétique)</term>
<term>Gènes bactériens</term>
<term>Génome</term>
<term>Humains</term>
<term>Lipoprotein lipase (génétique)</term>
<term>Logiciel</term>
<term>Modèles génétiques</term>
<term>Modèles statistiques</term>
<term>Ordinateurs</term>
<term>Phylogénie</term>
<term>Reconnaissance automatique des formes</term>
<term>Shigella flexneri (génétique)</term>
<term>Simulation numérique</term>
<term>Sondes oligonucléotidiques ()</term>
<term>Spécificité d'espèce</term>
<term>Séquençage par oligonucléotides en batterie</term>
<term>Virus du SRAS (génétique)</term>
</keywords>
<keywords scheme="MESH" type="chemical" qualifier="chemistry" xml:lang="en">
<term>DNA</term>
<term>Oligonucleotide Probes</term>
</keywords>
<keywords scheme="MESH" qualifier="genetics" xml:lang="en">
<term>Escherichia coli</term>
<term>Lipoprotein Lipase</term>
<term>SARS Virus</term>
<term>Shigella flexneri</term>
</keywords>
<keywords scheme="MESH" qualifier="génétique" xml:lang="fr">
<term>Escherichia coli</term>
<term>Lipoprotein lipase</term>
<term>Shigella flexneri</term>
<term>Virus du SRAS</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en">
<term>Computational Biology</term>
<term>Sequence Analysis, DNA</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Algorithms</term>
<term>Computer Simulation</term>
<term>Computers</term>
<term>Databases, Genetic</term>
<term>Databases, Protein</term>
<term>Genes, Bacterial</term>
<term>Genome</term>
<term>Humans</term>
<term>Models, Genetic</term>
<term>Models, Statistical</term>
<term>Oligonucleotide Array Sequence Analysis</term>
<term>Open Reading Frames</term>
<term>Pattern Recognition, Automated</term>
<term>Phylogeny</term>
<term>Software</term>
<term>Species Specificity</term>
</keywords>
<keywords scheme="MESH" xml:lang="fr">
<term>ADN</term>
<term>Algorithmes</term>
<term>Analyse de séquence d'ADN</term>
<term>Bases de données de protéines</term>
<term>Bases de données génétiques</term>
<term>Biologie informatique</term>
<term>Cadres ouverts de lecture</term>
<term>Gènes bactériens</term>
<term>Génome</term>
<term>Humains</term>
<term>Logiciel</term>
<term>Modèles génétiques</term>
<term>Modèles statistiques</term>
<term>Ordinateurs</term>
<term>Phylogénie</term>
<term>Reconnaissance automatique des formes</term>
<term>Simulation numérique</term>
<term>Sondes oligonucléotidiques</term>
<term>Spécificité d'espèce</term>
<term>Séquençage par oligonucléotides en batterie</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Several measures of DNA sequence dissimilarity have been developed. The purpose of this paper is 3-fold. Firstly, we compare the performance of several word-based or alignment-based methods. Secondly, we give a general guideline for choosing the window size and determining the optimal word sizes for several word-based measures at different window sizes. Thirdly, we use a large-scale simulation method to simulate data from the distribution of SK-LD (symmetric Kullback-Leibler discrepancy). These simulated data can be used to estimate the degree of dissimilarity beta between any pair of DNA sequences.</div>
</front>
</TEI>
<pubmed>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">16144805</PMID>
<DateCompleted>
<Year>2006</Year>
<Month>01</Month>
<Day>25</Day>
</DateCompleted>
<DateRevised>
<Year>2005</Year>
<Month>11</Month>
<Day>04</Day>
</DateRevised>
<Article PubModel="Print-Electronic">
<Journal>
<ISSN IssnType="Print">1367-4803</ISSN>
<JournalIssue CitedMedium="Print">
<Volume>21</Volume>
<Issue>22</Issue>
<PubDate>
<Year>2005</Year>
<Month>Nov</Month>
<Day>15</Day>
</PubDate>
</JournalIssue>
<Title>Bioinformatics (Oxford, England)</Title>
<ISOAbbreviation>Bioinformatics</ISOAbbreviation>
</Journal>
<ArticleTitle>Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences.</ArticleTitle>
<Pagination>
<MedlinePgn>4125-32</MedlinePgn>
</Pagination>
<Abstract>
<AbstractText Label="MOTIVATION" NlmCategory="BACKGROUND">Several measures of DNA sequence dissimilarity have been developed. The purpose of this paper is 3-fold. Firstly, we compare the performance of several word-based or alignment-based methods. Secondly, we give a general guideline for choosing the window size and determining the optimal word sizes for several word-based measures at different window sizes. Thirdly, we use a large-scale simulation method to simulate data from the distribution of SK-LD (symmetric Kullback-Leibler discrepancy). These simulated data can be used to estimate the degree of dissimilarity beta between any pair of DNA sequences.</AbstractText>
<AbstractText Label="RESULTS" NlmCategory="RESULTS">Our study shows (1) for whole sequence similiarity/dissimilarity identification the window size taken should be as large as possible, but probably not >3000, as restricted by CPU time in practice, (2) for each measure the optimal word size increases with window size, (3) when the optimal word size is used, SK-LD performance is superior in both simulation and real data analysis, (4) the estimate beta of beta based on SK-LD can be used to filter out quickly a large number of dissimilar sequences and speed alignment-based database search for similar sequences and (5) beta is also applicable in local similarity comparison situations. For example, it can help in selecting oligo probes with high specificity and, therefore, has potential in probe design for microarrays.</AbstractText>
<AbstractText Label="AVAILABILITY" NlmCategory="BACKGROUND">The algorithm SK-LD, estimate beta and simulation software are implemented in MATLAB code, and are available at http://www.stat.ncku.edu.tw/tjwu</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Wu</LastName>
<ForeName>Tiee-Jian</ForeName>
<Initials>TJ</Initials>
<AffiliationInfo>
<Affiliation>Department of Statistics, National Cheng-Kung University, Tainan, Taiwan. tjwu@stat.ncku.edu.tw</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Huang</LastName>
<ForeName>Ying-Hsueh</ForeName>
<Initials>YH</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Li</LastName>
<ForeName>Lung-An</ForeName>
<Initials>LA</Initials>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
<ArticleDate DateType="Electronic">
<Year>2005</Year>
<Month>09</Month>
<Day>06</Day>
</ArticleDate>
</Article>
<MedlineJournalInfo>
<Country>England</Country>
<MedlineTA>Bioinformatics</MedlineTA>
<NlmUniqueID>9808944</NlmUniqueID>
<ISSNLinking>1367-4803</ISSNLinking>
</MedlineJournalInfo>
<ChemicalList>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance UI="D015345">Oligonucleotide Probes</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>9007-49-2</RegistryNumber>
<NameOfSubstance UI="D004247">DNA</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>EC 3.1.1.34</RegistryNumber>
<NameOfSubstance UI="D008071">Lipoprotein Lipase</NameOfSubstance>
</Chemical>
</ChemicalList>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D000465" MajorTopicYN="N">Algorithms</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D019295" MajorTopicYN="N">Computational Biology</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="Y">methods</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D003198" MajorTopicYN="N">Computer Simulation</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D003201" MajorTopicYN="N">Computers</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D004247" MajorTopicYN="N">DNA</DescriptorName>
<QualifierName UI="Q000737" MajorTopicYN="N">chemistry</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D030541" MajorTopicYN="N">Databases, Genetic</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D030562" MajorTopicYN="N">Databases, Protein</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D004926" MajorTopicYN="N">Escherichia coli</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="N">genetics</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D005798" MajorTopicYN="N">Genes, Bacterial</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D016678" MajorTopicYN="N">Genome</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D006801" MajorTopicYN="N">Humans</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D008071" MajorTopicYN="N">Lipoprotein Lipase</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="N">genetics</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D008957" MajorTopicYN="N">Models, Genetic</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D015233" MajorTopicYN="N">Models, Statistical</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D020411" MajorTopicYN="N">Oligonucleotide Array Sequence Analysis</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D015345" MajorTopicYN="N">Oligonucleotide Probes</DescriptorName>
<QualifierName UI="Q000737" MajorTopicYN="N">chemistry</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D016366" MajorTopicYN="N">Open Reading Frames</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D010363" MajorTopicYN="N">Pattern Recognition, Automated</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D010802" MajorTopicYN="N">Phylogeny</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D045473" MajorTopicYN="N">SARS Virus</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="N">genetics</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D017422" MajorTopicYN="N">Sequence Analysis, DNA</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="Y">methods</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D012763" MajorTopicYN="N">Shigella flexneri</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="N">genetics</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D012984" MajorTopicYN="N">Software</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D013045" MajorTopicYN="N">Species Specificity</DescriptorName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="pubmed">
<Year>2005</Year>
<Month>9</Month>
<Day>8</Day>
<Hour>9</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2006</Year>
<Month>1</Month>
<Day>26</Day>
<Hour>9</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez">
<Year>2005</Year>
<Month>9</Month>
<Day>8</Day>
<Hour>9</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">16144805</ArticleId>
<ArticleId IdType="pii">bti658</ArticleId>
<ArticleId IdType="doi">10.1093/bioinformatics/bti658</ArticleId>
</ArticleIdList>
</PubmedData>
</pubmed>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/SrasV1/Data/PubMed/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002552 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PubMed/Curation/biblio.hfd -nk 002552 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    SrasV1
   |flux=    PubMed
   |étape=   Curation
   |type=    RBID
   |clé=     pubmed:16144805
   |texte=   Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/PubMed/Curation/RBID.i   -Sk "pubmed:16144805" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/PubMed/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a SrasV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Tue Apr 28 14:49:16 2020. Site generation: Sat Mar 27 22:06:49 2021