Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A study on the application of topic models to motif finding algorithms.

Identifieur interne : 000D96 ( PubMed/Curation ); précédent : 000D95; suivant : 000D97

A study on the application of topic models to motif finding algorithms.

Auteurs : Josep Basha Gutierrez [Japon] ; Kenta Nakai [Japon]

Source :

RBID : pubmed:28155646

Descripteurs français

English descriptors

Abstract

Topic models are statistical algorithms which try to discover the structure of a set of documents according to the abstract topics contained in them. Here we try to apply this approach to the discovery of the structure of the transcription factor binding sites (TFBS) contained in a set of biological sequences, which is a fundamental problem in molecular biology research for the understanding of transcriptional regulation. Here we present two methods that make use of topic models for motif finding. First, we developed an algorithm in which first a set of biological sequences are treated as text documents, and the k-mers contained in them as words, to then build a correlated topic model (CTM) and iteratively reduce its perplexity. We also used the perplexity measurement of CTMs to improve our previous algorithm based on a genetic algorithm and several statistical coefficients.

DOI: 10.1186/s12859-016-1364-3
PubMed: 28155646

Links toward previous steps (curation, corpus...)


Links to Exploration step

pubmed:28155646

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">A study on the application of topic models to motif finding algorithms.</title>
<author>
<name sortKey="Basha Gutierrez, Josep" sort="Basha Gutierrez, Josep" uniqKey="Basha Gutierrez J" first="Josep" last="Basha Gutierrez">Josep Basha Gutierrez</name>
<affiliation wicri:level="1">
<nlm:affiliation>Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 277-8561, Chiba, Japan.</nlm:affiliation>
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 277-8561, Chiba</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Nakai, Kenta" sort="Nakai, Kenta" uniqKey="Nakai K" first="Kenta" last="Nakai">Kenta Nakai</name>
<affiliation wicri:level="1">
<nlm:affiliation>Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 277-8561, Chiba, Japan. knakai@ims.u-tokyo.ac.jp.</nlm:affiliation>
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 277-8561, Chiba</wicri:regionArea>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2016">2016</date>
<idno type="RBID">pubmed:28155646</idno>
<idno type="pmid">28155646</idno>
<idno type="doi">10.1186/s12859-016-1364-3</idno>
<idno type="wicri:Area/PubMed/Corpus">000D96</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">000D96</idno>
<idno type="wicri:Area/PubMed/Curation">000D96</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">000D96</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">A study on the application of topic models to motif finding algorithms.</title>
<author>
<name sortKey="Basha Gutierrez, Josep" sort="Basha Gutierrez, Josep" uniqKey="Basha Gutierrez J" first="Josep" last="Basha Gutierrez">Josep Basha Gutierrez</name>
<affiliation wicri:level="1">
<nlm:affiliation>Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 277-8561, Chiba, Japan.</nlm:affiliation>
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 277-8561, Chiba</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Nakai, Kenta" sort="Nakai, Kenta" uniqKey="Nakai K" first="Kenta" last="Nakai">Kenta Nakai</name>
<affiliation wicri:level="1">
<nlm:affiliation>Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 277-8561, Chiba, Japan. knakai@ims.u-tokyo.ac.jp.</nlm:affiliation>
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 277-8561, Chiba</wicri:regionArea>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2016" type="published">2016</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Algorithms</term>
<term>Binding Sites</term>
<term>Computational Biology (methods)</term>
<term>Humans</term>
<term>Models, Theoretical</term>
<term>Monte Carlo Method</term>
<term>Nucleotide Motifs (genetics)</term>
<term>Protein Binding</term>
<term>Regulatory Sequences, Nucleic Acid (genetics)</term>
<term>Sequence Analysis, DNA (methods)</term>
<term>Transcription Factors (metabolism)</term>
</keywords>
<keywords scheme="KwdFr" xml:lang="fr">
<term>Algorithmes</term>
<term>Analyse de séquence d'ADN ()</term>
<term>Biologie informatique ()</term>
<term>Facteurs de transcription (métabolisme)</term>
<term>Humains</term>
<term>Liaison aux protéines</term>
<term>Modèles théoriques</term>
<term>Motifs nucléotidiques (génétique)</term>
<term>Méthode de Monte-Carlo</term>
<term>Sites de fixation</term>
<term>Séquences d'acides nucléiques régulatrices (génétique)</term>
</keywords>
<keywords scheme="MESH" type="chemical" qualifier="metabolism" xml:lang="en">
<term>Transcription Factors</term>
</keywords>
<keywords scheme="MESH" qualifier="genetics" xml:lang="en">
<term>Nucleotide Motifs</term>
<term>Regulatory Sequences, Nucleic Acid</term>
</keywords>
<keywords scheme="MESH" qualifier="génétique" xml:lang="fr">
<term>Motifs nucléotidiques</term>
<term>Séquences d'acides nucléiques régulatrices</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en">
<term>Computational Biology</term>
<term>Sequence Analysis, DNA</term>
</keywords>
<keywords scheme="MESH" qualifier="métabolisme" xml:lang="fr">
<term>Facteurs de transcription</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Algorithms</term>
<term>Binding Sites</term>
<term>Humans</term>
<term>Models, Theoretical</term>
<term>Monte Carlo Method</term>
<term>Protein Binding</term>
</keywords>
<keywords scheme="MESH" xml:lang="fr">
<term>Algorithmes</term>
<term>Analyse de séquence d'ADN</term>
<term>Biologie informatique</term>
<term>Humains</term>
<term>Liaison aux protéines</term>
<term>Modèles théoriques</term>
<term>Méthode de Monte-Carlo</term>
<term>Sites de fixation</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Topic models are statistical algorithms which try to discover the structure of a set of documents according to the abstract topics contained in them. Here we try to apply this approach to the discovery of the structure of the transcription factor binding sites (TFBS) contained in a set of biological sequences, which is a fundamental problem in molecular biology research for the understanding of transcriptional regulation. Here we present two methods that make use of topic models for motif finding. First, we developed an algorithm in which first a set of biological sequences are treated as text documents, and the k-mers contained in them as words, to then build a correlated topic model (CTM) and iteratively reduce its perplexity. We also used the perplexity measurement of CTMs to improve our previous algorithm based on a genetic algorithm and several statistical coefficients.</div>
</front>
</TEI>
<pubmed>
<MedlineCitation Status="MEDLINE" IndexingMethod="Curated" Owner="NLM">
<PMID Version="1">28155646</PMID>
<DateCompleted>
<Year>2017</Year>
<Month>08</Month>
<Day>17</Day>
</DateCompleted>
<DateRevised>
<Year>2018</Year>
<Month>12</Month>
<Day>02</Day>
</DateRevised>
<Article PubModel="Electronic">
<Journal>
<ISSN IssnType="Electronic">1471-2105</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>17</Volume>
<Issue>Suppl 19</Issue>
<PubDate>
<Year>2016</Year>
<Month>Dec</Month>
<Day>22</Day>
</PubDate>
</JournalIssue>
<Title>BMC bioinformatics</Title>
<ISOAbbreviation>BMC Bioinformatics</ISOAbbreviation>
</Journal>
<ArticleTitle>A study on the application of topic models to motif finding algorithms.</ArticleTitle>
<Pagination>
<MedlinePgn>502</MedlinePgn>
</Pagination>
<ELocationID EIdType="doi" ValidYN="Y">10.1186/s12859-016-1364-3</ELocationID>
<Abstract>
<AbstractText Label="BACKGROUND" NlmCategory="BACKGROUND">Topic models are statistical algorithms which try to discover the structure of a set of documents according to the abstract topics contained in them. Here we try to apply this approach to the discovery of the structure of the transcription factor binding sites (TFBS) contained in a set of biological sequences, which is a fundamental problem in molecular biology research for the understanding of transcriptional regulation. Here we present two methods that make use of topic models for motif finding. First, we developed an algorithm in which first a set of biological sequences are treated as text documents, and the k-mers contained in them as words, to then build a correlated topic model (CTM) and iteratively reduce its perplexity. We also used the perplexity measurement of CTMs to improve our previous algorithm based on a genetic algorithm and several statistical coefficients.</AbstractText>
<AbstractText Label="RESULTS" NlmCategory="RESULTS">The algorithms were tested with 56 data sets from four different species and compared to 14 other methods by the use of several coefficients both at nucleotide and site level. The results of our first approach showed a performance comparable to the other methods studied, especially at site level and in sensitivity scores, in which it scored better than any of the 14 existing tools. In the case of our previous algorithm, the new approach with the addition of the perplexity measurement clearly outperformed all of the other methods in sensitivity, both at nucleotide and site level, and in overall performance at site level.</AbstractText>
<AbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS">The statistics obtained show that the performance of a motif finding method based on the use of a CTM is satisfying enough to conclude that the application of topic models is a valid method for developing motif finding algorithms. Moreover, the addition of topic models to a previously developed method dramatically increased its performance, suggesting that this combined algorithm can be a useful tool to successfully predict motifs in different kinds of sets of DNA sequences.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Basha Gutierrez</LastName>
<ForeName>Josep</ForeName>
<Initials>J</Initials>
<AffiliationInfo>
<Affiliation>Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 277-8561, Chiba, Japan.</Affiliation>
</AffiliationInfo>
<AffiliationInfo>
<Affiliation>Human Genome Center, The Institute of Medical Science, The University of Tokyo, 4-6-1 Shirokane-dai, Minato-ku, 108-8639, Tokyo, Japan.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Nakai</LastName>
<ForeName>Kenta</ForeName>
<Initials>K</Initials>
<AffiliationInfo>
<Affiliation>Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 277-8561, Chiba, Japan. knakai@ims.u-tokyo.ac.jp.</Affiliation>
</AffiliationInfo>
<AffiliationInfo>
<Affiliation>Human Genome Center, The Institute of Medical Science, The University of Tokyo, 4-6-1 Shirokane-dai, Minato-ku, 108-8639, Tokyo, Japan. knakai@ims.u-tokyo.ac.jp.</Affiliation>
</AffiliationInfo>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
<ArticleDate DateType="Electronic">
<Year>2016</Year>
<Month>12</Month>
<Day>22</Day>
</ArticleDate>
</Article>
<MedlineJournalInfo>
<Country>England</Country>
<MedlineTA>BMC Bioinformatics</MedlineTA>
<NlmUniqueID>100965194</NlmUniqueID>
<ISSNLinking>1471-2105</ISSNLinking>
</MedlineJournalInfo>
<ChemicalList>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance UI="D014157">Transcription Factors</NameOfSubstance>
</Chemical>
</ChemicalList>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D000465" MajorTopicYN="Y">Algorithms</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D001665" MajorTopicYN="N">Binding Sites</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D019295" MajorTopicYN="N">Computational Biology</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="Y">methods</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D006801" MajorTopicYN="N">Humans</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D008962" MajorTopicYN="Y">Models, Theoretical</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D009010" MajorTopicYN="N">Monte Carlo Method</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D059372" MajorTopicYN="N">Nucleotide Motifs</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="Y">genetics</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D011485" MajorTopicYN="N">Protein Binding</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D012045" MajorTopicYN="N">Regulatory Sequences, Nucleic Acid</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="Y">genetics</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D017422" MajorTopicYN="N">Sequence Analysis, DNA</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="Y">methods</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D014157" MajorTopicYN="N">Transcription Factors</DescriptorName>
<QualifierName UI="Q000378" MajorTopicYN="Y">metabolism</QualifierName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="entrez">
<Year>2017</Year>
<Month>2</Month>
<Day>4</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed">
<Year>2017</Year>
<Month>2</Month>
<Day>6</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2017</Year>
<Month>8</Month>
<Day>18</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>epublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">28155646</ArticleId>
<ArticleId IdType="doi">10.1186/s12859-016-1364-3</ArticleId>
<ArticleId IdType="pii">10.1186/s12859-016-1364-3</ArticleId>
<ArticleId IdType="pmc">PMC5259985</ArticleId>
</ArticleIdList>
<ReferenceList>
<Reference>
<Citation>Pac Symp Biocomput. 2000;:467-78</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">10902194</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>J Mol Biol. 1998 Sep 4;281(5):827-42</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">9719638</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Science. 2004 Sep 17;305(5691):1743-6</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">15375261</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Theor Biol Med Model. 2013 Feb 14;10:11</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23409927</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>J Mol Biol. 2000 Mar 10;296(5):1205-14</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">10698627</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Bioinformatics. 2005 Apr 27;6:109</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">15857505</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 1996 Jan 1;24(1):238-41</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">8594589</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2002;18 Suppl 1:S354-63</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">12169566</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2003 Jul 1;31(13):3586-8</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">12824371</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 1999 Jul-Aug;15(7-8):563-77</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">10487864</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Proc Int Conf Intell Syst Mol Biol. 1995;3:21-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">7584439</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nat Biotechnol. 2005 Jan;23(1):137-44</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">15637633</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W199-203</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">15215380</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2001 Dec;17(12):1113-22</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">11751219</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2000 Apr 15;28(8):1808-18</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">10734201</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2004 Jan 02;32(1):189-200</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">14704356</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genomics. 1996 Jun 15;34(3):353-67</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">8786136</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Proc Int Conf Intell Syst Mol Biol. 2000;8:269-78</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">10977088</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2005 May 15;21(10):2240-5</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">15728117</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
</PubmedData>
</pubmed>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/PubMed/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000D96 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PubMed/Curation/biblio.hfd -nk 000D96 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    PubMed
   |étape=   Curation
   |type=    RBID
   |clé=     pubmed:28155646
   |texte=   A study on the application of topic models to motif finding algorithms.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/PubMed/Curation/RBID.i   -Sk "pubmed:28155646" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/PubMed/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021