Prediction of cis-regulatory elements: from high-information content analysis to motif identification.
Identifieur interne : 002162 ( PubMed/Corpus ); précédent : 002161; suivant : 002163Prediction of cis-regulatory elements: from high-information content analysis to motif identification.
Auteurs : Guojun Li ; Jizhu Lu ; Victor Olman ; Ying XuSource :
- Journal of bioinformatics and computational biology [ 0219-7200 ] ; 2007.
English descriptors
- KwdEn :
- Algorithms, Binding Sites (genetics), Cluster Analysis, Computer Simulation, Consensus Sequence, Models, Genetic, Models, Statistical, Pattern Recognition, Automated (methods), Predictive Value of Tests, Promoter Regions, Genetic, Regulatory Sequences, Nucleic Acid, Sequence Alignment, Transcription Factors (genetics).
- MESH :
- chemical , genetics : Transcription Factors.
- genetics : Binding Sites.
- methods : Pattern Recognition, Automated.
- Algorithms, Cluster Analysis, Computer Simulation, Consensus Sequence, Models, Genetic, Models, Statistical, Predictive Value of Tests, Promoter Regions, Genetic, Regulatory Sequences, Nucleic Acid, Sequence Alignment.
Abstract
One popular approach to prediction of binding motifs of transcription factors is to model the problem as to search for a group of l-mers (motifs), for some l > 0, one from each of the provided promoter regions of a group of co-expressed genes, that exhibit high information content when aligned without gaps. In our current work, we assume that these desired l-mers have evolved from a common ancestor, each of which has mutations in at most k-positions from the common ancestor, where k is substantially smaller than l. This implies that these l-mers should belong to the k-neighborhood of their common ancestor, measured in terms of Hamming distance. If the ancestor is given, then the problem for finding these l-mers becomes trivial. Unfortunately, the problem of identifying the unknown ancestor is probably as hard as the problem of predicting the motifs themselves. Our goal is to identify a set of l-mers that slightly violate the k-neighborhood of a putative ancestor, but capture all the desired motifs, which will lead to an efficient way for identification of the desired motifs. The main contributions of this paper are in four aspects: (a) we have derived nontrivial lower and upper bounds of information content for a set of l-mers that differ from an unknown ancestor in no more than k positions; (b) we have defined a new distance between two sequences and a k-pseudo-neighborhood, based on the new distance, that contains the k-neighborhood, defined by Hamming distance, of the to-be-defined ancestor; (c) we have developed an algorithm to minimize the sum of all the distances between a predicted ancestor motif and a group of l-mers from the provided promoter regions, using the new distance; and (d) we have tested PROMOCO and compared its prediction results performance with two other prediction programs. The algorithm, implemented as a computer software program PROMOCO, has been used to find all conserved motifs in a set of provided promoter sequences. Our preliminary application of PROMOCO shows that it achieves better or comparable prediction results, when compared to popular programs for identification of cis regulatory binding motifs. A limitation of the algorithm is that it does not work well when the size of the set of provided promoter sequences is too small or when desired motifs appear in only small portion of the given sequences.
DOI: 10.1142/s021972000700293x
PubMed: 17787058
Links to Exploration step
pubmed:17787058Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Prediction of cis-regulatory elements: from high-information content analysis to motif identification.</title>
<author><name sortKey="Li, Guojun" sort="Li, Guojun" uniqKey="Li G" first="Guojun" last="Li">Guojun Li</name>
<affiliation><nlm:affiliation>School of Mathematics and System Sciences, Shandong University, Jinan 250100, China. guojun@csbl.bmb.uga.edu</nlm:affiliation>
</affiliation>
</author>
<author><name sortKey="Lu, Jizhu" sort="Lu, Jizhu" uniqKey="Lu J" first="Jizhu" last="Lu">Jizhu Lu</name>
</author>
<author><name sortKey="Olman, Victor" sort="Olman, Victor" uniqKey="Olman V" first="Victor" last="Olman">Victor Olman</name>
</author>
<author><name sortKey="Xu, Ying" sort="Xu, Ying" uniqKey="Xu Y" first="Ying" last="Xu">Ying Xu</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PubMed</idno>
<date when="2007">2007</date>
<idno type="RBID">pubmed:17787058</idno>
<idno type="pmid">17787058</idno>
<idno type="doi">10.1142/s021972000700293x</idno>
<idno type="wicri:Area/PubMed/Corpus">002162</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">002162</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">Prediction of cis-regulatory elements: from high-information content analysis to motif identification.</title>
<author><name sortKey="Li, Guojun" sort="Li, Guojun" uniqKey="Li G" first="Guojun" last="Li">Guojun Li</name>
<affiliation><nlm:affiliation>School of Mathematics and System Sciences, Shandong University, Jinan 250100, China. guojun@csbl.bmb.uga.edu</nlm:affiliation>
</affiliation>
</author>
<author><name sortKey="Lu, Jizhu" sort="Lu, Jizhu" uniqKey="Lu J" first="Jizhu" last="Lu">Jizhu Lu</name>
</author>
<author><name sortKey="Olman, Victor" sort="Olman, Victor" uniqKey="Olman V" first="Victor" last="Olman">Victor Olman</name>
</author>
<author><name sortKey="Xu, Ying" sort="Xu, Ying" uniqKey="Xu Y" first="Ying" last="Xu">Ying Xu</name>
</author>
</analytic>
<series><title level="j">Journal of bioinformatics and computational biology</title>
<idno type="ISSN">0219-7200</idno>
<imprint><date when="2007" type="published">2007</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Algorithms</term>
<term>Binding Sites (genetics)</term>
<term>Cluster Analysis</term>
<term>Computer Simulation</term>
<term>Consensus Sequence</term>
<term>Models, Genetic</term>
<term>Models, Statistical</term>
<term>Pattern Recognition, Automated (methods)</term>
<term>Predictive Value of Tests</term>
<term>Promoter Regions, Genetic</term>
<term>Regulatory Sequences, Nucleic Acid</term>
<term>Sequence Alignment</term>
<term>Transcription Factors (genetics)</term>
</keywords>
<keywords scheme="MESH" type="chemical" qualifier="genetics" xml:lang="en"><term>Transcription Factors</term>
</keywords>
<keywords scheme="MESH" qualifier="genetics" xml:lang="en"><term>Binding Sites</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en"><term>Pattern Recognition, Automated</term>
</keywords>
<keywords scheme="MESH" xml:lang="en"><term>Algorithms</term>
<term>Cluster Analysis</term>
<term>Computer Simulation</term>
<term>Consensus Sequence</term>
<term>Models, Genetic</term>
<term>Models, Statistical</term>
<term>Predictive Value of Tests</term>
<term>Promoter Regions, Genetic</term>
<term>Regulatory Sequences, Nucleic Acid</term>
<term>Sequence Alignment</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">One popular approach to prediction of binding motifs of transcription factors is to model the problem as to search for a group of l-mers (motifs), for some l > 0, one from each of the provided promoter regions of a group of co-expressed genes, that exhibit high information content when aligned without gaps. In our current work, we assume that these desired l-mers have evolved from a common ancestor, each of which has mutations in at most k-positions from the common ancestor, where k is substantially smaller than l. This implies that these l-mers should belong to the k-neighborhood of their common ancestor, measured in terms of Hamming distance. If the ancestor is given, then the problem for finding these l-mers becomes trivial. Unfortunately, the problem of identifying the unknown ancestor is probably as hard as the problem of predicting the motifs themselves. Our goal is to identify a set of l-mers that slightly violate the k-neighborhood of a putative ancestor, but capture all the desired motifs, which will lead to an efficient way for identification of the desired motifs. The main contributions of this paper are in four aspects: (a) we have derived nontrivial lower and upper bounds of information content for a set of l-mers that differ from an unknown ancestor in no more than k positions; (b) we have defined a new distance between two sequences and a k-pseudo-neighborhood, based on the new distance, that contains the k-neighborhood, defined by Hamming distance, of the to-be-defined ancestor; (c) we have developed an algorithm to minimize the sum of all the distances between a predicted ancestor motif and a group of l-mers from the provided promoter regions, using the new distance; and (d) we have tested PROMOCO and compared its prediction results performance with two other prediction programs. The algorithm, implemented as a computer software program PROMOCO, has been used to find all conserved motifs in a set of provided promoter sequences. Our preliminary application of PROMOCO shows that it achieves better or comparable prediction results, when compared to popular programs for identification of cis regulatory binding motifs. A limitation of the algorithm is that it does not work well when the size of the set of provided promoter sequences is too small or when desired motifs appear in only small portion of the given sequences.</div>
</front>
</TEI>
<pubmed><MedlineCitation Status="MEDLINE" Owner="NLM"><PMID Version="1">17787058</PMID>
<DateCompleted><Year>2007</Year>
<Month>11</Month>
<Day>13</Day>
</DateCompleted>
<DateRevised><Year>2019</Year>
<Month>11</Month>
<Day>10</Day>
</DateRevised>
<Article PubModel="Print"><Journal><ISSN IssnType="Print">0219-7200</ISSN>
<JournalIssue CitedMedium="Print"><Volume>5</Volume>
<Issue>4</Issue>
<PubDate><Year>2007</Year>
<Month>Aug</Month>
</PubDate>
</JournalIssue>
<Title>Journal of bioinformatics and computational biology</Title>
<ISOAbbreviation>J Bioinform Comput Biol</ISOAbbreviation>
</Journal>
<ArticleTitle>Prediction of cis-regulatory elements: from high-information content analysis to motif identification.</ArticleTitle>
<Pagination><MedlinePgn>817-38</MedlinePgn>
</Pagination>
<Abstract><AbstractText>One popular approach to prediction of binding motifs of transcription factors is to model the problem as to search for a group of l-mers (motifs), for some l > 0, one from each of the provided promoter regions of a group of co-expressed genes, that exhibit high information content when aligned without gaps. In our current work, we assume that these desired l-mers have evolved from a common ancestor, each of which has mutations in at most k-positions from the common ancestor, where k is substantially smaller than l. This implies that these l-mers should belong to the k-neighborhood of their common ancestor, measured in terms of Hamming distance. If the ancestor is given, then the problem for finding these l-mers becomes trivial. Unfortunately, the problem of identifying the unknown ancestor is probably as hard as the problem of predicting the motifs themselves. Our goal is to identify a set of l-mers that slightly violate the k-neighborhood of a putative ancestor, but capture all the desired motifs, which will lead to an efficient way for identification of the desired motifs. The main contributions of this paper are in four aspects: (a) we have derived nontrivial lower and upper bounds of information content for a set of l-mers that differ from an unknown ancestor in no more than k positions; (b) we have defined a new distance between two sequences and a k-pseudo-neighborhood, based on the new distance, that contains the k-neighborhood, defined by Hamming distance, of the to-be-defined ancestor; (c) we have developed an algorithm to minimize the sum of all the distances between a predicted ancestor motif and a group of l-mers from the provided promoter regions, using the new distance; and (d) we have tested PROMOCO and compared its prediction results performance with two other prediction programs. The algorithm, implemented as a computer software program PROMOCO, has been used to find all conserved motifs in a set of provided promoter sequences. Our preliminary application of PROMOCO shows that it achieves better or comparable prediction results, when compared to popular programs for identification of cis regulatory binding motifs. A limitation of the algorithm is that it does not work well when the size of the set of provided promoter sequences is too small or when desired motifs appear in only small portion of the given sequences.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y"><Author ValidYN="Y"><LastName>Li</LastName>
<ForeName>Guojun</ForeName>
<Initials>G</Initials>
<AffiliationInfo><Affiliation>School of Mathematics and System Sciences, Shandong University, Jinan 250100, China. guojun@csbl.bmb.uga.edu</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y"><LastName>Lu</LastName>
<ForeName>Jizhu</ForeName>
<Initials>J</Initials>
</Author>
<Author ValidYN="Y"><LastName>Olman</LastName>
<ForeName>Victor</ForeName>
<Initials>V</Initials>
</Author>
<Author ValidYN="Y"><LastName>Xu</LastName>
<ForeName>Ying</ForeName>
<Initials>Y</Initials>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList><PublicationType UI="D016428">Journal Article</PublicationType>
<PublicationType UI="D013485">Research Support, Non-U.S. Gov't</PublicationType>
<PublicationType UI="D013486">Research Support, U.S. Gov't, Non-P.H.S.</PublicationType>
</PublicationTypeList>
</Article>
<MedlineJournalInfo><Country>Singapore</Country>
<MedlineTA>J Bioinform Comput Biol</MedlineTA>
<NlmUniqueID>101187344</NlmUniqueID>
<ISSNLinking>0219-7200</ISSNLinking>
</MedlineJournalInfo>
<ChemicalList><Chemical><RegistryNumber>0</RegistryNumber>
<NameOfSubstance UI="D014157">Transcription Factors</NameOfSubstance>
</Chemical>
</ChemicalList>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList><MeshHeading><DescriptorName UI="D000465" MajorTopicYN="Y">Algorithms</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D001665" MajorTopicYN="N">Binding Sites</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="N">genetics</QualifierName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D016000" MajorTopicYN="N">Cluster Analysis</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D003198" MajorTopicYN="N">Computer Simulation</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D016384" MajorTopicYN="N">Consensus Sequence</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D008957" MajorTopicYN="Y">Models, Genetic</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D015233" MajorTopicYN="N">Models, Statistical</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D010363" MajorTopicYN="N">Pattern Recognition, Automated</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="N">methods</QualifierName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D011237" MajorTopicYN="N">Predictive Value of Tests</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D011401" MajorTopicYN="N">Promoter Regions, Genetic</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D012045" MajorTopicYN="Y">Regulatory Sequences, Nucleic Acid</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D016415" MajorTopicYN="N">Sequence Alignment</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D014157" MajorTopicYN="Y">Transcription Factors</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="N">genetics</QualifierName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
<PubmedData><History><PubMedPubDate PubStatus="received"><Year>2006</Year>
<Month>09</Month>
<Day>11</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="revised"><Year>2007</Year>
<Month>03</Month>
<Day>22</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="accepted"><Year>2007</Year>
<Month>03</Month>
<Day>22</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed"><Year>2007</Year>
<Month>9</Month>
<Day>6</Day>
<Hour>9</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline"><Year>2007</Year>
<Month>11</Month>
<Day>14</Day>
<Hour>9</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez"><Year>2007</Year>
<Month>9</Month>
<Day>6</Day>
<Hour>9</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList><ArticleId IdType="pubmed">17787058</ArticleId>
<ArticleId IdType="pii">S021972000700293X</ArticleId>
<ArticleId IdType="doi">10.1142/s021972000700293x</ArticleId>
</ArticleIdList>
</PubmedData>
</pubmed>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/PubMed/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002162 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/PubMed/Corpus/biblio.hfd -nk 002162 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Sante |area= MersV1 |flux= PubMed |étape= Corpus |type= RBID |clé= pubmed:17787058 |texte= Prediction of cis-regulatory elements: from high-information content analysis to motif identification. }}
Pour générer des pages wiki
HfdIndexSelect -h $EXPLOR_AREA/Data/PubMed/Corpus/RBID.i -Sk "pubmed:17787058" \ | HfdSelect -Kh $EXPLOR_AREA/Data/PubMed/Corpus/biblio.hfd \ | NlmPubMed2Wicri -a MersV1
This area was generated with Dilib version V0.6.33. |