MersV1, PubMed, Curation, bibRecord, 002430

Analysis of common k-mers for whole genome sequences using SSB-tree.

Identifieur interne : 002430 ( PubMed/Curation ); précédent : 002429; suivant : 002431

Analysis of common k-mers for whole genome sequences using SSB-tree.

Auteurs : Jeong-Hyeon Choi [Corée du Sud] ; Hwan-Gue Cho

Source :

Genome informatics. International Conference on Genome Informatics [ 0919-9454 ] ; 2002.

RBID : pubmed:14571372

Descripteurs français

KwdFr :
- Analyse de séquence d'ADN (), Archéobactéries (génétique), Bactéries (génétique), Biologie informatique (), Génome, Interprétation statistique de données, Logiciel, Oligonucléotides (génétique).
MESH :
- génétique : Archéobactéries, Bactéries, Oligonucléotides.
- Analyse de séquence d'ADN, Biologie informatique, Génome, Interprétation statistique de données, Logiciel.

English descriptors

KwdEn :
- Archaea (genetics), Bacteria (genetics), Computational Biology (methods), Data Interpretation, Statistical, Genome, Oligonucleotides (genetics), Sequence Analysis, DNA (methods), Software.
MESH :
- chemical , genetics : Oligonucleotides.
- genetics : Archaea, Bacteria.
- methods : Computational Biology, Sequence Analysis, DNA.
- Data Interpretation, Statistical, Genome, Software.

Abstract

As sequenced genomes become larger and sequencing process becomes faster, there is a need to develop a tool to analyze sequences in the whole genomic scale. However, on-memory algorithms such as suffix tree and suffix array are not applicable to the analysis of whole genome sequence set, since the size of individual whole genome ranges from several million base pairs to hundreds billion base pairs. In order to effectively manipulate the huge sequence data, it is necessary to use the indexed data structure for external memory. In this paper, we introduce a workbench called SequeX for the analysis and visualization of whole genome sequences using SSB-tree (Static SB-tree). It consists of two parts: the analysis query subsystem and the visualization subsystem. The query subsystem supports various transactions such as pattern matching, k-occurrence, and k-mer analysis. The visualization subsystem helps biologists to easily understand whole genome structure and feature by sequence viewer, annotation viewer, CGR (Chaos Game Representation) viewer, and k-mer viewer. The system also supports a user-friendly programming interface based on Java script for batch processing and the extension for a specific purpose of a user. SequeX can be used to identify conserved genes or sequences by the analysis of the common k-mers and annotation. We analyze the common k-mer for 72 microbial genomes announced by Entrez, and find an interesting biological fact that the longest common k-mer for 72 sequences is 11-mer, and only 11 such sequences exist. Finally we note that many common k-mers occur in conserved region such as CDS, rRNA, and tRNA.

PubMed: 14571372

Links toward previous steps (curation, corpus...)

to stream PubMed, to step Corpus: Pour aller vers cette notice dans l'étape Curation :002430

Links to Exploration step

pubmed:14571372

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Analysis of common k-mers for whole genome sequences using SSB-tree.</title>
<author><name sortKey="Choi, Jeong Hyeon" sort="Choi, Jeong Hyeon" uniqKey="Choi J" first="Jeong-Hyeon" last="Choi">Jeong-Hyeon Choi</name>
<affiliation wicri:level="1"><nlm:affiliation>ALGORIGENE Bioinformatics Lab., Department of Computer Science, Pusan National University, Kum-Jung-Ku, Pusan 609-735, Korea. jhchoi@pusan.ac.kr</nlm:affiliation>
<country xml:lang="fr">Corée du Sud</country>
<wicri:regionArea>ALGORIGENE Bioinformatics Lab., Department of Computer Science, Pusan National University, Kum-Jung-Ku, Pusan 609-735</wicri:regionArea>
</affiliation>
</author>
<author><name sortKey="Cho, Hwan Gue" sort="Cho, Hwan Gue" uniqKey="Cho H" first="Hwan-Gue" last="Cho">Hwan-Gue Cho</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PubMed</idno>
<date when="2002">2002</date>
<idno type="RBID">pubmed:14571372</idno>
<idno type="pmid">14571372</idno>
<idno type="wicri:Area/PubMed/Corpus">002430</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">002430</idno>
<idno type="wicri:Area/PubMed/Curation">002430</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">002430</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">Analysis of common k-mers for whole genome sequences using SSB-tree.</title>
<author><name sortKey="Choi, Jeong Hyeon" sort="Choi, Jeong Hyeon" uniqKey="Choi J" first="Jeong-Hyeon" last="Choi">Jeong-Hyeon Choi</name>
<affiliation wicri:level="1"><nlm:affiliation>ALGORIGENE Bioinformatics Lab., Department of Computer Science, Pusan National University, Kum-Jung-Ku, Pusan 609-735, Korea. jhchoi@pusan.ac.kr</nlm:affiliation>
<country xml:lang="fr">Corée du Sud</country>
<wicri:regionArea>ALGORIGENE Bioinformatics Lab., Department of Computer Science, Pusan National University, Kum-Jung-Ku, Pusan 609-735</wicri:regionArea>
</affiliation>
</author>
<author><name sortKey="Cho, Hwan Gue" sort="Cho, Hwan Gue" uniqKey="Cho H" first="Hwan-Gue" last="Cho">Hwan-Gue Cho</name>
</author>
</analytic>
<series><title level="j">Genome informatics. International Conference on Genome Informatics</title>
<idno type="ISSN">0919-9454</idno>
<imprint><date when="2002" type="published">2002</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Archaea (genetics)</term>
<term>Bacteria (genetics)</term>
<term>Computational Biology (methods)</term>
<term>Data Interpretation, Statistical</term>
<term>Genome</term>
<term>Oligonucleotides (genetics)</term>
<term>Sequence Analysis, DNA (methods)</term>
<term>Software</term>
</keywords>
<keywords scheme="KwdFr" xml:lang="fr"><term>Analyse de séquence d'ADN ()</term>
<term>Archéobactéries (génétique)</term>
<term>Bactéries (génétique)</term>
<term>Biologie informatique ()</term>
<term>Génome</term>
<term>Interprétation statistique de données</term>
<term>Logiciel</term>
<term>Oligonucléotides (génétique)</term>
</keywords>
<keywords scheme="MESH" type="chemical" qualifier="genetics" xml:lang="en"><term>Oligonucleotides</term>
</keywords>
<keywords scheme="MESH" qualifier="genetics" xml:lang="en"><term>Archaea</term>
<term>Bacteria</term>
</keywords>
<keywords scheme="MESH" qualifier="génétique" xml:lang="fr"><term>Archéobactéries</term>
<term>Bactéries</term>
<term>Oligonucléotides</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en"><term>Computational Biology</term>
<term>Sequence Analysis, DNA</term>
</keywords>
<keywords scheme="MESH" xml:lang="en"><term>Data Interpretation, Statistical</term>
<term>Genome</term>
<term>Software</term>
</keywords>
<keywords scheme="MESH" xml:lang="fr"><term>Analyse de séquence d'ADN</term>
<term>Biologie informatique</term>
<term>Génome</term>
<term>Interprétation statistique de données</term>
<term>Logiciel</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">As sequenced genomes become larger and sequencing process becomes faster, there is a need to develop a tool to analyze sequences in the whole genomic scale. However, on-memory algorithms such as suffix tree and suffix array are not applicable to the analysis of whole genome sequence set, since the size of individual whole genome ranges from several million base pairs to hundreds billion base pairs. In order to effectively manipulate the huge sequence data, it is necessary to use the indexed data structure for external memory. In this paper, we introduce a workbench called SequeX for the analysis and visualization of whole genome sequences using SSB-tree (Static SB-tree). It consists of two parts: the analysis query subsystem and the visualization subsystem. The query subsystem supports various transactions such as pattern matching, k-occurrence, and k-mer analysis. The visualization subsystem helps biologists to easily understand whole genome structure and feature by sequence viewer, annotation viewer, CGR (Chaos Game Representation) viewer, and k-mer viewer. The system also supports a user-friendly programming interface based on Java script for batch processing and the extension for a specific purpose of a user. SequeX can be used to identify conserved genes or sequences by the analysis of the common k-mers and annotation. We analyze the common k-mer for 72 microbial genomes announced by Entrez, and find an interesting biological fact that the longest common k-mer for 72 sequences is 11-mer, and only 11 such sequences exist. Finally we note that many common k-mers occur in conserved region such as CDS, rRNA, and tRNA.</div>
</front>
</TEI>
<pubmed><MedlineCitation Status="MEDLINE" Owner="NLM"><PMID Version="1">14571372</PMID>
<DateCompleted><Year>2003</Year>
<Month>11</Month>
<Day>26</Day>
</DateCompleted>
<DateRevised><Year>2006</Year>
<Month>08</Month>
<Day>08</Day>
</DateRevised>
<Article PubModel="Print"><Journal><ISSN IssnType="Print">0919-9454</ISSN>
<JournalIssue CitedMedium="Print"><Volume>13</Volume>
<PubDate><Year>2002</Year>
</PubDate>
</JournalIssue>
<Title>Genome informatics. International Conference on Genome Informatics</Title>
<ISOAbbreviation>Genome Inform</ISOAbbreviation>
</Journal>
<ArticleTitle>Analysis of common k-mers for whole genome sequences using SSB-tree.</ArticleTitle>
<Pagination><MedlinePgn>30-41</MedlinePgn>
</Pagination>
<Abstract><AbstractText>As sequenced genomes become larger and sequencing process becomes faster, there is a need to develop a tool to analyze sequences in the whole genomic scale. However, on-memory algorithms such as suffix tree and suffix array are not applicable to the analysis of whole genome sequence set, since the size of individual whole genome ranges from several million base pairs to hundreds billion base pairs. In order to effectively manipulate the huge sequence data, it is necessary to use the indexed data structure for external memory. In this paper, we introduce a workbench called SequeX for the analysis and visualization of whole genome sequences using SSB-tree (Static SB-tree). It consists of two parts: the analysis query subsystem and the visualization subsystem. The query subsystem supports various transactions such as pattern matching, k-occurrence, and k-mer analysis. The visualization subsystem helps biologists to easily understand whole genome structure and feature by sequence viewer, annotation viewer, CGR (Chaos Game Representation) viewer, and k-mer viewer. The system also supports a user-friendly programming interface based on Java script for batch processing and the extension for a specific purpose of a user. SequeX can be used to identify conserved genes or sequences by the analysis of the common k-mers and annotation. We analyze the common k-mer for 72 microbial genomes announced by Entrez, and find an interesting biological fact that the longest common k-mer for 72 sequences is 11-mer, and only 11 such sequences exist. Finally we note that many common k-mers occur in conserved region such as CDS, rRNA, and tRNA.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y"><Author ValidYN="Y"><LastName>Choi</LastName>
<ForeName>Jeong-Hyeon</ForeName>
<Initials>JH</Initials>
<AffiliationInfo><Affiliation>ALGORIGENE Bioinformatics Lab., Department of Computer Science, Pusan National University, Kum-Jung-Ku, Pusan 609-735, Korea. jhchoi@pusan.ac.kr</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y"><LastName>Cho</LastName>
<ForeName>Hwan-Gue</ForeName>
<Initials>HG</Initials>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList><PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
</Article>
<MedlineJournalInfo><Country>Japan</Country>
<MedlineTA>Genome Inform</MedlineTA>
<NlmUniqueID>101280573</NlmUniqueID>
<ISSNLinking>0919-9454</ISSNLinking>
</MedlineJournalInfo>
<ChemicalList><Chemical><RegistryNumber>0</RegistryNumber>
<NameOfSubstance UI="D009841">Oligonucleotides</NameOfSubstance>
</Chemical>
</ChemicalList>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList><MeshHeading><DescriptorName UI="D001105" MajorTopicYN="N">Archaea</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="N">genetics</QualifierName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D001419" MajorTopicYN="N">Bacteria</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="N">genetics</QualifierName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D019295" MajorTopicYN="N">Computational Biology</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="Y">methods</QualifierName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D003627" MajorTopicYN="Y">Data Interpretation, Statistical</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D016678" MajorTopicYN="N">Genome</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D009841" MajorTopicYN="N">Oligonucleotides</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="N">genetics</QualifierName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D017422" MajorTopicYN="N">Sequence Analysis, DNA</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="Y">methods</QualifierName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D012984" MajorTopicYN="Y">Software</DescriptorName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
<PubmedData><History><PubMedPubDate PubStatus="pubmed"><Year>2003</Year>
<Month>10</Month>
<Day>23</Day>
<Hour>5</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline"><Year>2003</Year>
<Month>12</Month>
<Day>3</Day>
<Hour>5</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez"><Year>2003</Year>
<Month>10</Month>
<Day>23</Day>
<Hour>5</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList><ArticleId IdType="pubmed">14571372</ArticleId>
</ArticleIdList>
</PubmedData>
</pubmed>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/PubMed/Curation

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002430 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/PubMed/Curation/biblio.hfd -nk 002430 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    PubMed
   |étape=   Curation
   |type=    RBID
   |clé=     pubmed:14571372
   |texte=   Analysis of common k-mers for whole genome sequences using SSB-tree.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/PubMed/Curation/RBID.i   -Sk "pubmed:14571372" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/PubMed/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration MERS

Analysis of common k-mers for whole genome sequences using SSB-tree.

Analysis of common k-mers for whole genome sequences using SSB-tree.

Source :

Descripteurs français

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki