Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Clustering of DNA words and biological function: a proof of principle.

Identifieur interne : 000917 ( Ncbi/Merge ); précédent : 000916; suivant : 000918

Clustering of DNA words and biological function: a proof of principle.

Auteurs : Michael Hackenberg [Espagne] ; Antonio Rueda ; Pedro Carpena ; Pedro Bernaola-Galván ; Guillermo Barturen ; José L. Oliver

Source :

RBID : pubmed:22226985

Descripteurs français

English descriptors

Abstract

Relevant words in literary texts (key words) are known to be clustered, while common words are randomly distributed. Given the clustered distribution of many functional genome elements, we hypothesize that the biological text per excellence, the DNA sequence, might behave in the same way: k-length words (k-mers) with a clear function may be spatially clustered along the one-dimensional chromosome sequence, while less-important, non-functional words may be randomly distributed. To explore this linguistic analogy, we calculate a clustering coefficient for each k-mer (k=2-9bp) in human and mouse chromosome sequences, then checking if clustered words are enriched in the functional part of the genome. First, we found a positive general trend relating clustering level and word enrichment within exons and Transcription Factor Binding Sites (TFBSs), while a much weaker relation exists for repeats, and no relation at all exists for introns. Second, we found that 38.45% of the 200 top-clustered 8-mers, but only 7.70% of the non-clustered words, are represented in known motif databases. Third, enrichment/depletion experiments show that highly clustered words are significantly enriched in exons and TFBSs, while they are depleted in introns and repetitive DNA. Considering exons and TFBSs together, 1417 (or 72.26%) in human and 1385 (or 72.97%) in mouse of the top-clustered 8-mers showed a statistically significant association to either exons or TFBSs, thus strongly supporting the link between word clustering and biological function. Lastly, we identified a subset of clustered, diagnostic words that are enriched in exons but depleted in introns, and therefore might help to discriminate between these two gene regions. The clustering of DNA words thus appears as a novel principle to detect functionality in genome sequences. As evolutionary conservation is not a prerequisite, the proof of principle described here may open new ways to detect species-specific functional DNA sequences and the improvement of gene and promoter predictions, thus contributing to the quest for function in the genome.

DOI: 10.1016/j.jtbi.2011.12.024
PubMed: 22226985

Links toward previous steps (curation, corpus...)


Links to Exploration step

pubmed:22226985

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Clustering of DNA words and biological function: a proof of principle.</title>
<author>
<name sortKey="Hackenberg, Michael" sort="Hackenberg, Michael" uniqKey="Hackenberg M" first="Michael" last="Hackenberg">Michael Hackenberg</name>
<affiliation wicri:level="4">
<nlm:affiliation>Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Campus de Fuentenueva s/n, 18071-Granada, Spain. mlhack@gmail.com</nlm:affiliation>
<country xml:lang="fr">Espagne</country>
<wicri:regionArea>Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Campus de Fuentenueva s/n, 18071-Granada</wicri:regionArea>
<placeName>
<region nuts="2" type="communauté">Andalousie</region>
<settlement type="city">Grenade (Espagne)</settlement>
</placeName>
<orgName type="university">Université de Grenade</orgName>
</affiliation>
</author>
<author>
<name sortKey="Rueda, Antonio" sort="Rueda, Antonio" uniqKey="Rueda A" first="Antonio" last="Rueda">Antonio Rueda</name>
</author>
<author>
<name sortKey="Carpena, Pedro" sort="Carpena, Pedro" uniqKey="Carpena P" first="Pedro" last="Carpena">Pedro Carpena</name>
</author>
<author>
<name sortKey="Bernaola Galvan, Pedro" sort="Bernaola Galvan, Pedro" uniqKey="Bernaola Galvan P" first="Pedro" last="Bernaola-Galván">Pedro Bernaola-Galván</name>
</author>
<author>
<name sortKey="Barturen, Guillermo" sort="Barturen, Guillermo" uniqKey="Barturen G" first="Guillermo" last="Barturen">Guillermo Barturen</name>
</author>
<author>
<name sortKey="Oliver, Jose L" sort="Oliver, Jose L" uniqKey="Oliver J" first="José L" last="Oliver">José L. Oliver</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2012">2012</date>
<idno type="RBID">pubmed:22226985</idno>
<idno type="pmid">22226985</idno>
<idno type="doi">10.1016/j.jtbi.2011.12.024</idno>
<idno type="wicri:Area/PubMed/Corpus">001E19</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">001E19</idno>
<idno type="wicri:Area/PubMed/Curation">001E19</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">001E19</idno>
<idno type="wicri:Area/PubMed/Checkpoint">001C94</idno>
<idno type="wicri:explorRef" wicri:stream="Checkpoint" wicri:step="PubMed">001C94</idno>
<idno type="wicri:Area/Ncbi/Merge">000917</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Clustering of DNA words and biological function: a proof of principle.</title>
<author>
<name sortKey="Hackenberg, Michael" sort="Hackenberg, Michael" uniqKey="Hackenberg M" first="Michael" last="Hackenberg">Michael Hackenberg</name>
<affiliation wicri:level="4">
<nlm:affiliation>Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Campus de Fuentenueva s/n, 18071-Granada, Spain. mlhack@gmail.com</nlm:affiliation>
<country xml:lang="fr">Espagne</country>
<wicri:regionArea>Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Campus de Fuentenueva s/n, 18071-Granada</wicri:regionArea>
<placeName>
<region nuts="2" type="communauté">Andalousie</region>
<settlement type="city">Grenade (Espagne)</settlement>
</placeName>
<orgName type="university">Université de Grenade</orgName>
</affiliation>
</author>
<author>
<name sortKey="Rueda, Antonio" sort="Rueda, Antonio" uniqKey="Rueda A" first="Antonio" last="Rueda">Antonio Rueda</name>
</author>
<author>
<name sortKey="Carpena, Pedro" sort="Carpena, Pedro" uniqKey="Carpena P" first="Pedro" last="Carpena">Pedro Carpena</name>
</author>
<author>
<name sortKey="Bernaola Galvan, Pedro" sort="Bernaola Galvan, Pedro" uniqKey="Bernaola Galvan P" first="Pedro" last="Bernaola-Galván">Pedro Bernaola-Galván</name>
</author>
<author>
<name sortKey="Barturen, Guillermo" sort="Barturen, Guillermo" uniqKey="Barturen G" first="Guillermo" last="Barturen">Guillermo Barturen</name>
</author>
<author>
<name sortKey="Oliver, Jose L" sort="Oliver, Jose L" uniqKey="Oliver J" first="José L" last="Oliver">José L. Oliver</name>
</author>
</analytic>
<series>
<title level="j">Journal of theoretical biology</title>
<idno type="eISSN">1095-8541</idno>
<imprint>
<date when="2012" type="published">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Algorithms</term>
<term>Animals</term>
<term>Base Sequence</term>
<term>Binding Sites (genetics)</term>
<term>Cluster Analysis</term>
<term>DNA (genetics)</term>
<term>Exons (genetics)</term>
<term>Humans</term>
<term>Introns (genetics)</term>
<term>Linguistics</term>
<term>Mice</term>
<term>Models, Genetic</term>
<term>Species Specificity</term>
<term>Transcription Factors (genetics)</term>
</keywords>
<keywords scheme="KwdFr" xml:lang="fr">
<term>ADN (génétique)</term>
<term>Algorithmes</term>
<term>Analyse de regroupements</term>
<term>Animaux</term>
<term>Exons (génétique)</term>
<term>Facteurs de transcription (génétique)</term>
<term>Humains</term>
<term>Introns (génétique)</term>
<term>Linguistique</term>
<term>Modèles génétiques</term>
<term>Sites de fixation (génétique)</term>
<term>Souris</term>
<term>Spécificité d'espèce</term>
<term>Séquence nucléotidique</term>
</keywords>
<keywords scheme="MESH" type="chemical" qualifier="genetics" xml:lang="en">
<term>DNA</term>
<term>Transcription Factors</term>
</keywords>
<keywords scheme="MESH" qualifier="genetics" xml:lang="en">
<term>Binding Sites</term>
<term>Exons</term>
<term>Introns</term>
</keywords>
<keywords scheme="MESH" qualifier="génétique" xml:lang="fr">
<term>ADN</term>
<term>Exons</term>
<term>Facteurs de transcription</term>
<term>Introns</term>
<term>Sites de fixation</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Algorithms</term>
<term>Animals</term>
<term>Base Sequence</term>
<term>Cluster Analysis</term>
<term>Humans</term>
<term>Linguistics</term>
<term>Mice</term>
<term>Models, Genetic</term>
<term>Species Specificity</term>
</keywords>
<keywords scheme="MESH" xml:lang="fr">
<term>Algorithmes</term>
<term>Analyse de regroupements</term>
<term>Animaux</term>
<term>Humains</term>
<term>Linguistique</term>
<term>Modèles génétiques</term>
<term>Souris</term>
<term>Spécificité d'espèce</term>
<term>Séquence nucléotidique</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Relevant words in literary texts (key words) are known to be clustered, while common words are randomly distributed. Given the clustered distribution of many functional genome elements, we hypothesize that the biological text per excellence, the DNA sequence, might behave in the same way: k-length words (k-mers) with a clear function may be spatially clustered along the one-dimensional chromosome sequence, while less-important, non-functional words may be randomly distributed. To explore this linguistic analogy, we calculate a clustering coefficient for each k-mer (k=2-9bp) in human and mouse chromosome sequences, then checking if clustered words are enriched in the functional part of the genome. First, we found a positive general trend relating clustering level and word enrichment within exons and Transcription Factor Binding Sites (TFBSs), while a much weaker relation exists for repeats, and no relation at all exists for introns. Second, we found that 38.45% of the 200 top-clustered 8-mers, but only 7.70% of the non-clustered words, are represented in known motif databases. Third, enrichment/depletion experiments show that highly clustered words are significantly enriched in exons and TFBSs, while they are depleted in introns and repetitive DNA. Considering exons and TFBSs together, 1417 (or 72.26%) in human and 1385 (or 72.97%) in mouse of the top-clustered 8-mers showed a statistically significant association to either exons or TFBSs, thus strongly supporting the link between word clustering and biological function. Lastly, we identified a subset of clustered, diagnostic words that are enriched in exons but depleted in introns, and therefore might help to discriminate between these two gene regions. The clustering of DNA words thus appears as a novel principle to detect functionality in genome sequences. As evolutionary conservation is not a prerequisite, the proof of principle described here may open new ways to detect species-specific functional DNA sequences and the improvement of gene and promoter predictions, thus contributing to the quest for function in the genome.</div>
</front>
</TEI>
<pubmed>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">22226985</PMID>
<DateCompleted>
<Year>2012</Year>
<Month>04</Month>
<Day>30</Day>
</DateCompleted>
<DateRevised>
<Year>2012</Year>
<Month>02</Month>
<Day>13</Day>
</DateRevised>
<Article PubModel="Print-Electronic">
<Journal>
<ISSN IssnType="Electronic">1095-8541</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>297</Volume>
<PubDate>
<Year>2012</Year>
<Month>Mar</Month>
<Day>21</Day>
</PubDate>
</JournalIssue>
<Title>Journal of theoretical biology</Title>
<ISOAbbreviation>J. Theor. Biol.</ISOAbbreviation>
</Journal>
<ArticleTitle>Clustering of DNA words and biological function: a proof of principle.</ArticleTitle>
<Pagination>
<MedlinePgn>127-36</MedlinePgn>
</Pagination>
<ELocationID EIdType="doi" ValidYN="Y">10.1016/j.jtbi.2011.12.024</ELocationID>
<Abstract>
<AbstractText>Relevant words in literary texts (key words) are known to be clustered, while common words are randomly distributed. Given the clustered distribution of many functional genome elements, we hypothesize that the biological text per excellence, the DNA sequence, might behave in the same way: k-length words (k-mers) with a clear function may be spatially clustered along the one-dimensional chromosome sequence, while less-important, non-functional words may be randomly distributed. To explore this linguistic analogy, we calculate a clustering coefficient for each k-mer (k=2-9bp) in human and mouse chromosome sequences, then checking if clustered words are enriched in the functional part of the genome. First, we found a positive general trend relating clustering level and word enrichment within exons and Transcription Factor Binding Sites (TFBSs), while a much weaker relation exists for repeats, and no relation at all exists for introns. Second, we found that 38.45% of the 200 top-clustered 8-mers, but only 7.70% of the non-clustered words, are represented in known motif databases. Third, enrichment/depletion experiments show that highly clustered words are significantly enriched in exons and TFBSs, while they are depleted in introns and repetitive DNA. Considering exons and TFBSs together, 1417 (or 72.26%) in human and 1385 (or 72.97%) in mouse of the top-clustered 8-mers showed a statistically significant association to either exons or TFBSs, thus strongly supporting the link between word clustering and biological function. Lastly, we identified a subset of clustered, diagnostic words that are enriched in exons but depleted in introns, and therefore might help to discriminate between these two gene regions. The clustering of DNA words thus appears as a novel principle to detect functionality in genome sequences. As evolutionary conservation is not a prerequisite, the proof of principle described here may open new ways to detect species-specific functional DNA sequences and the improvement of gene and promoter predictions, thus contributing to the quest for function in the genome.</AbstractText>
<CopyrightInformation>Copyright © 2011 Elsevier Ltd. All rights reserved.</CopyrightInformation>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Hackenberg</LastName>
<ForeName>Michael</ForeName>
<Initials>M</Initials>
<AffiliationInfo>
<Affiliation>Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Campus de Fuentenueva s/n, 18071-Granada, Spain. mlhack@gmail.com</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Rueda</LastName>
<ForeName>Antonio</ForeName>
<Initials>A</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Carpena</LastName>
<ForeName>Pedro</ForeName>
<Initials>P</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Bernaola-Galván</LastName>
<ForeName>Pedro</ForeName>
<Initials>P</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Barturen</LastName>
<ForeName>Guillermo</ForeName>
<Initials>G</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Oliver</LastName>
<ForeName>José L</ForeName>
<Initials>JL</Initials>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
<PublicationType UI="D013485">Research Support, Non-U.S. Gov't</PublicationType>
</PublicationTypeList>
<ArticleDate DateType="Electronic">
<Year>2011</Year>
<Month>12</Month>
<Day>30</Day>
</ArticleDate>
</Article>
<MedlineJournalInfo>
<Country>England</Country>
<MedlineTA>J Theor Biol</MedlineTA>
<NlmUniqueID>0376342</NlmUniqueID>
<ISSNLinking>0022-5193</ISSNLinking>
</MedlineJournalInfo>
<ChemicalList>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance UI="D014157">Transcription Factors</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>9007-49-2</RegistryNumber>
<NameOfSubstance UI="D004247">DNA</NameOfSubstance>
</Chemical>
</ChemicalList>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D000465" MajorTopicYN="N">Algorithms</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D000818" MajorTopicYN="N">Animals</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D001483" MajorTopicYN="N">Base Sequence</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D001665" MajorTopicYN="N">Binding Sites</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="N">genetics</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D016000" MajorTopicYN="N">Cluster Analysis</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D004247" MajorTopicYN="N">DNA</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="Y">genetics</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D005091" MajorTopicYN="N">Exons</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="N">genetics</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D006801" MajorTopicYN="N">Humans</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D007438" MajorTopicYN="N">Introns</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="N">genetics</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D008037" MajorTopicYN="N">Linguistics</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D051379" MajorTopicYN="N">Mice</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D008957" MajorTopicYN="Y">Models, Genetic</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D013045" MajorTopicYN="N">Species Specificity</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D014157" MajorTopicYN="N">Transcription Factors</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="N">genetics</QualifierName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="received">
<Year>2011</Year>
<Month>10</Month>
<Day>17</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="revised">
<Year>2011</Year>
<Month>12</Month>
<Day>20</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="accepted">
<Year>2011</Year>
<Month>12</Month>
<Day>21</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez">
<Year>2012</Year>
<Month>1</Month>
<Day>10</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed">
<Year>2012</Year>
<Month>1</Month>
<Day>10</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2012</Year>
<Month>5</Month>
<Day>1</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">22226985</ArticleId>
<ArticleId IdType="pii">S0022-5193(11)00648-5</ArticleId>
<ArticleId IdType="doi">10.1016/j.jtbi.2011.12.024</ArticleId>
</ArticleIdList>
</PubmedData>
</pubmed>
<affiliations>
<list>
<country>
<li>Espagne</li>
</country>
<region>
<li>Andalousie</li>
</region>
<settlement>
<li>Grenade (Espagne)</li>
</settlement>
<orgName>
<li>Université de Grenade</li>
</orgName>
</list>
<tree>
<noCountry>
<name sortKey="Barturen, Guillermo" sort="Barturen, Guillermo" uniqKey="Barturen G" first="Guillermo" last="Barturen">Guillermo Barturen</name>
<name sortKey="Bernaola Galvan, Pedro" sort="Bernaola Galvan, Pedro" uniqKey="Bernaola Galvan P" first="Pedro" last="Bernaola-Galván">Pedro Bernaola-Galván</name>
<name sortKey="Carpena, Pedro" sort="Carpena, Pedro" uniqKey="Carpena P" first="Pedro" last="Carpena">Pedro Carpena</name>
<name sortKey="Oliver, Jose L" sort="Oliver, Jose L" uniqKey="Oliver J" first="José L" last="Oliver">José L. Oliver</name>
<name sortKey="Rueda, Antonio" sort="Rueda, Antonio" uniqKey="Rueda A" first="Antonio" last="Rueda">Antonio Rueda</name>
</noCountry>
<country name="Espagne">
<region name="Andalousie">
<name sortKey="Hackenberg, Michael" sort="Hackenberg, Michael" uniqKey="Hackenberg M" first="Michael" last="Hackenberg">Michael Hackenberg</name>
</region>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Ncbi/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000917 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Ncbi/Merge/biblio.hfd -nk 000917 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Ncbi
   |étape=   Merge
   |type=    RBID
   |clé=     pubmed:22226985
   |texte=   Clustering of DNA words and biological function: a proof of principle.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Ncbi/Merge/RBID.i   -Sk "pubmed:22226985" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Ncbi/Merge/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021