Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Prediction of cis-regulatory elements: from high-information content analysis to motif identification.

Identifieur interne : 002162 ( PubMed/Curation ); précédent : 002161; suivant : 002163

Prediction of cis-regulatory elements: from high-information content analysis to motif identification.

Auteurs : Guojun Li [République populaire de Chine] ; Jizhu Lu ; Victor Olman ; Ying Xu

Source :

RBID : pubmed:17787058

Descripteurs français

English descriptors

Abstract

One popular approach to prediction of binding motifs of transcription factors is to model the problem as to search for a group of l-mers (motifs), for some l > 0, one from each of the provided promoter regions of a group of co-expressed genes, that exhibit high information content when aligned without gaps. In our current work, we assume that these desired l-mers have evolved from a common ancestor, each of which has mutations in at most k-positions from the common ancestor, where k is substantially smaller than l. This implies that these l-mers should belong to the k-neighborhood of their common ancestor, measured in terms of Hamming distance. If the ancestor is given, then the problem for finding these l-mers becomes trivial. Unfortunately, the problem of identifying the unknown ancestor is probably as hard as the problem of predicting the motifs themselves. Our goal is to identify a set of l-mers that slightly violate the k-neighborhood of a putative ancestor, but capture all the desired motifs, which will lead to an efficient way for identification of the desired motifs. The main contributions of this paper are in four aspects: (a) we have derived nontrivial lower and upper bounds of information content for a set of l-mers that differ from an unknown ancestor in no more than k positions; (b) we have defined a new distance between two sequences and a k-pseudo-neighborhood, based on the new distance, that contains the k-neighborhood, defined by Hamming distance, of the to-be-defined ancestor; (c) we have developed an algorithm to minimize the sum of all the distances between a predicted ancestor motif and a group of l-mers from the provided promoter regions, using the new distance; and (d) we have tested PROMOCO and compared its prediction results performance with two other prediction programs. The algorithm, implemented as a computer software program PROMOCO, has been used to find all conserved motifs in a set of provided promoter sequences. Our preliminary application of PROMOCO shows that it achieves better or comparable prediction results, when compared to popular programs for identification of cis regulatory binding motifs. A limitation of the algorithm is that it does not work well when the size of the set of provided promoter sequences is too small or when desired motifs appear in only small portion of the given sequences.

DOI: 10.1142/s021972000700293x
PubMed: 17787058

Links toward previous steps (curation, corpus...)


Links to Exploration step

pubmed:17787058

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Prediction of cis-regulatory elements: from high-information content analysis to motif identification.</title>
<author>
<name sortKey="Li, Guojun" sort="Li, Guojun" uniqKey="Li G" first="Guojun" last="Li">Guojun Li</name>
<affiliation wicri:level="1">
<nlm:affiliation>School of Mathematics and System Sciences, Shandong University, Jinan 250100, China. guojun@csbl.bmb.uga.edu</nlm:affiliation>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>School of Mathematics and System Sciences, Shandong University, Jinan 250100</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Lu, Jizhu" sort="Lu, Jizhu" uniqKey="Lu J" first="Jizhu" last="Lu">Jizhu Lu</name>
</author>
<author>
<name sortKey="Olman, Victor" sort="Olman, Victor" uniqKey="Olman V" first="Victor" last="Olman">Victor Olman</name>
</author>
<author>
<name sortKey="Xu, Ying" sort="Xu, Ying" uniqKey="Xu Y" first="Ying" last="Xu">Ying Xu</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2007">2007</date>
<idno type="RBID">pubmed:17787058</idno>
<idno type="pmid">17787058</idno>
<idno type="doi">10.1142/s021972000700293x</idno>
<idno type="wicri:Area/PubMed/Corpus">002162</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">002162</idno>
<idno type="wicri:Area/PubMed/Curation">002162</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">002162</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Prediction of cis-regulatory elements: from high-information content analysis to motif identification.</title>
<author>
<name sortKey="Li, Guojun" sort="Li, Guojun" uniqKey="Li G" first="Guojun" last="Li">Guojun Li</name>
<affiliation wicri:level="1">
<nlm:affiliation>School of Mathematics and System Sciences, Shandong University, Jinan 250100, China. guojun@csbl.bmb.uga.edu</nlm:affiliation>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>School of Mathematics and System Sciences, Shandong University, Jinan 250100</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Lu, Jizhu" sort="Lu, Jizhu" uniqKey="Lu J" first="Jizhu" last="Lu">Jizhu Lu</name>
</author>
<author>
<name sortKey="Olman, Victor" sort="Olman, Victor" uniqKey="Olman V" first="Victor" last="Olman">Victor Olman</name>
</author>
<author>
<name sortKey="Xu, Ying" sort="Xu, Ying" uniqKey="Xu Y" first="Ying" last="Xu">Ying Xu</name>
</author>
</analytic>
<series>
<title level="j">Journal of bioinformatics and computational biology</title>
<idno type="ISSN">0219-7200</idno>
<imprint>
<date when="2007" type="published">2007</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Algorithms</term>
<term>Binding Sites (genetics)</term>
<term>Cluster Analysis</term>
<term>Computer Simulation</term>
<term>Consensus Sequence</term>
<term>Models, Genetic</term>
<term>Models, Statistical</term>
<term>Pattern Recognition, Automated (methods)</term>
<term>Predictive Value of Tests</term>
<term>Promoter Regions, Genetic</term>
<term>Regulatory Sequences, Nucleic Acid</term>
<term>Sequence Alignment</term>
<term>Transcription Factors (genetics)</term>
</keywords>
<keywords scheme="KwdFr" xml:lang="fr">
<term>Algorithmes</term>
<term>Alignement de séquences</term>
<term>Analyse de regroupements</term>
<term>Facteurs de transcription (génétique)</term>
<term>Modèles génétiques</term>
<term>Modèles statistiques</term>
<term>Reconnaissance automatique des formes ()</term>
<term>Régions promotrices (génétique)</term>
<term>Simulation numérique</term>
<term>Sites de fixation (génétique)</term>
<term>Séquence consensus</term>
<term>Séquences d'acides nucléiques régulatrices</term>
<term>Valeur prédictive des tests</term>
</keywords>
<keywords scheme="MESH" type="chemical" qualifier="genetics" xml:lang="en">
<term>Transcription Factors</term>
</keywords>
<keywords scheme="MESH" qualifier="genetics" xml:lang="en">
<term>Binding Sites</term>
</keywords>
<keywords scheme="MESH" qualifier="génétique" xml:lang="fr">
<term>Facteurs de transcription</term>
<term>Sites de fixation</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en">
<term>Pattern Recognition, Automated</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Algorithms</term>
<term>Cluster Analysis</term>
<term>Computer Simulation</term>
<term>Consensus Sequence</term>
<term>Models, Genetic</term>
<term>Models, Statistical</term>
<term>Predictive Value of Tests</term>
<term>Promoter Regions, Genetic</term>
<term>Regulatory Sequences, Nucleic Acid</term>
<term>Sequence Alignment</term>
</keywords>
<keywords scheme="MESH" xml:lang="fr">
<term>Algorithmes</term>
<term>Alignement de séquences</term>
<term>Analyse de regroupements</term>
<term>Modèles génétiques</term>
<term>Modèles statistiques</term>
<term>Reconnaissance automatique des formes</term>
<term>Régions promotrices (génétique)</term>
<term>Simulation numérique</term>
<term>Séquence consensus</term>
<term>Séquences d'acides nucléiques régulatrices</term>
<term>Valeur prédictive des tests</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">One popular approach to prediction of binding motifs of transcription factors is to model the problem as to search for a group of l-mers (motifs), for some l > 0, one from each of the provided promoter regions of a group of co-expressed genes, that exhibit high information content when aligned without gaps. In our current work, we assume that these desired l-mers have evolved from a common ancestor, each of which has mutations in at most k-positions from the common ancestor, where k is substantially smaller than l. This implies that these l-mers should belong to the k-neighborhood of their common ancestor, measured in terms of Hamming distance. If the ancestor is given, then the problem for finding these l-mers becomes trivial. Unfortunately, the problem of identifying the unknown ancestor is probably as hard as the problem of predicting the motifs themselves. Our goal is to identify a set of l-mers that slightly violate the k-neighborhood of a putative ancestor, but capture all the desired motifs, which will lead to an efficient way for identification of the desired motifs. The main contributions of this paper are in four aspects: (a) we have derived nontrivial lower and upper bounds of information content for a set of l-mers that differ from an unknown ancestor in no more than k positions; (b) we have defined a new distance between two sequences and a k-pseudo-neighborhood, based on the new distance, that contains the k-neighborhood, defined by Hamming distance, of the to-be-defined ancestor; (c) we have developed an algorithm to minimize the sum of all the distances between a predicted ancestor motif and a group of l-mers from the provided promoter regions, using the new distance; and (d) we have tested PROMOCO and compared its prediction results performance with two other prediction programs. The algorithm, implemented as a computer software program PROMOCO, has been used to find all conserved motifs in a set of provided promoter sequences. Our preliminary application of PROMOCO shows that it achieves better or comparable prediction results, when compared to popular programs for identification of cis regulatory binding motifs. A limitation of the algorithm is that it does not work well when the size of the set of provided promoter sequences is too small or when desired motifs appear in only small portion of the given sequences.</div>
</front>
</TEI>
<pubmed>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">17787058</PMID>
<DateCompleted>
<Year>2007</Year>
<Month>11</Month>
<Day>13</Day>
</DateCompleted>
<DateRevised>
<Year>2019</Year>
<Month>11</Month>
<Day>10</Day>
</DateRevised>
<Article PubModel="Print">
<Journal>
<ISSN IssnType="Print">0219-7200</ISSN>
<JournalIssue CitedMedium="Print">
<Volume>5</Volume>
<Issue>4</Issue>
<PubDate>
<Year>2007</Year>
<Month>Aug</Month>
</PubDate>
</JournalIssue>
<Title>Journal of bioinformatics and computational biology</Title>
<ISOAbbreviation>J Bioinform Comput Biol</ISOAbbreviation>
</Journal>
<ArticleTitle>Prediction of cis-regulatory elements: from high-information content analysis to motif identification.</ArticleTitle>
<Pagination>
<MedlinePgn>817-38</MedlinePgn>
</Pagination>
<Abstract>
<AbstractText>One popular approach to prediction of binding motifs of transcription factors is to model the problem as to search for a group of l-mers (motifs), for some l > 0, one from each of the provided promoter regions of a group of co-expressed genes, that exhibit high information content when aligned without gaps. In our current work, we assume that these desired l-mers have evolved from a common ancestor, each of which has mutations in at most k-positions from the common ancestor, where k is substantially smaller than l. This implies that these l-mers should belong to the k-neighborhood of their common ancestor, measured in terms of Hamming distance. If the ancestor is given, then the problem for finding these l-mers becomes trivial. Unfortunately, the problem of identifying the unknown ancestor is probably as hard as the problem of predicting the motifs themselves. Our goal is to identify a set of l-mers that slightly violate the k-neighborhood of a putative ancestor, but capture all the desired motifs, which will lead to an efficient way for identification of the desired motifs. The main contributions of this paper are in four aspects: (a) we have derived nontrivial lower and upper bounds of information content for a set of l-mers that differ from an unknown ancestor in no more than k positions; (b) we have defined a new distance between two sequences and a k-pseudo-neighborhood, based on the new distance, that contains the k-neighborhood, defined by Hamming distance, of the to-be-defined ancestor; (c) we have developed an algorithm to minimize the sum of all the distances between a predicted ancestor motif and a group of l-mers from the provided promoter regions, using the new distance; and (d) we have tested PROMOCO and compared its prediction results performance with two other prediction programs. The algorithm, implemented as a computer software program PROMOCO, has been used to find all conserved motifs in a set of provided promoter sequences. Our preliminary application of PROMOCO shows that it achieves better or comparable prediction results, when compared to popular programs for identification of cis regulatory binding motifs. A limitation of the algorithm is that it does not work well when the size of the set of provided promoter sequences is too small or when desired motifs appear in only small portion of the given sequences.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Li</LastName>
<ForeName>Guojun</ForeName>
<Initials>G</Initials>
<AffiliationInfo>
<Affiliation>School of Mathematics and System Sciences, Shandong University, Jinan 250100, China. guojun@csbl.bmb.uga.edu</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Lu</LastName>
<ForeName>Jizhu</ForeName>
<Initials>J</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Olman</LastName>
<ForeName>Victor</ForeName>
<Initials>V</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Xu</LastName>
<ForeName>Ying</ForeName>
<Initials>Y</Initials>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
<PublicationType UI="D013485">Research Support, Non-U.S. Gov't</PublicationType>
<PublicationType UI="D013486">Research Support, U.S. Gov't, Non-P.H.S.</PublicationType>
</PublicationTypeList>
</Article>
<MedlineJournalInfo>
<Country>Singapore</Country>
<MedlineTA>J Bioinform Comput Biol</MedlineTA>
<NlmUniqueID>101187344</NlmUniqueID>
<ISSNLinking>0219-7200</ISSNLinking>
</MedlineJournalInfo>
<ChemicalList>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance UI="D014157">Transcription Factors</NameOfSubstance>
</Chemical>
</ChemicalList>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D000465" MajorTopicYN="Y">Algorithms</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D001665" MajorTopicYN="N">Binding Sites</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="N">genetics</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D016000" MajorTopicYN="N">Cluster Analysis</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D003198" MajorTopicYN="N">Computer Simulation</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D016384" MajorTopicYN="N">Consensus Sequence</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D008957" MajorTopicYN="Y">Models, Genetic</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D015233" MajorTopicYN="N">Models, Statistical</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D010363" MajorTopicYN="N">Pattern Recognition, Automated</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="N">methods</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D011237" MajorTopicYN="N">Predictive Value of Tests</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D011401" MajorTopicYN="N">Promoter Regions, Genetic</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D012045" MajorTopicYN="Y">Regulatory Sequences, Nucleic Acid</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D016415" MajorTopicYN="N">Sequence Alignment</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D014157" MajorTopicYN="Y">Transcription Factors</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="N">genetics</QualifierName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="received">
<Year>2006</Year>
<Month>09</Month>
<Day>11</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="revised">
<Year>2007</Year>
<Month>03</Month>
<Day>22</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="accepted">
<Year>2007</Year>
<Month>03</Month>
<Day>22</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed">
<Year>2007</Year>
<Month>9</Month>
<Day>6</Day>
<Hour>9</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2007</Year>
<Month>11</Month>
<Day>14</Day>
<Hour>9</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez">
<Year>2007</Year>
<Month>9</Month>
<Day>6</Day>
<Hour>9</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">17787058</ArticleId>
<ArticleId IdType="pii">S021972000700293X</ArticleId>
<ArticleId IdType="doi">10.1142/s021972000700293x</ArticleId>
</ArticleIdList>
</PubmedData>
</pubmed>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/PubMed/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002162 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PubMed/Curation/biblio.hfd -nk 002162 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    PubMed
   |étape=   Curation
   |type=    RBID
   |clé=     pubmed:17787058
   |texte=   Prediction of cis-regulatory elements: from high-information content analysis to motif identification.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/PubMed/Curation/RBID.i   -Sk "pubmed:17787058" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/PubMed/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021