Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Comparative analysis using K-mer and K-flank patterns provides evidence for CpG island sequence evolution in mammalian genomes

Identifieur interne : 000F51 ( Pmc/Corpus ); précédent : 000F50; suivant : 000F52

Comparative analysis using K-mer and K-flank patterns provides evidence for CpG island sequence evolution in mammalian genomes

Auteurs : Heejoon Chae ; Jinwoo Park ; Seong-Whan Lee ; Kenneth P. Nephew ; Sun Kim

Source :

RBID : PMC:3643570

Abstract

CpG islands are GC-rich regions often located in the 5′ end of genes and normally protected from cytosine methylation in mammals. The important role of CpG islands in gene transcription strongly suggests evolutionary conservation in the mammalian genome. However, as CpG dinucleotides are over-represented in CpG islands, comparative CpG island analysis using conventional sequence analysis techniques remains a major challenge in the epigenetics field. In this study, we conducted a comparative analysis of all CpG island sequences in 10 mammalian genomes. As sequence similarity methods and character composition techniques such as information theory are particularly difficult to conduct, we used exact patterns in CpG island sequences and single character discrepancies to identify differences in CpG island sequences. First, by calculating genome distance based on rank correlation tests, we show that k-mer and k-flank patterns around CpG sites can be used to correctly reconstruct the phylogeny of 10 mammalian genomes. Further, we used various machine learning algorithms to demonstrate that CpG islands sequences can be characterized using k-mers. In addition, by testing a human model on the nine different mammalian genomes, we provide the first evidence that k-mer signatures are consistent with evolutionary history.


Url:
DOI: 10.1093/nar/gkt144
PubMed: 23519616
PubMed Central: 3643570

Links to Exploration step

PMC:3643570

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Comparative analysis using K-mer and K-flank patterns provides evidence for CpG island sequence evolution in mammalian genomes</title>
<author>
<name sortKey="Chae, Heejoon" sort="Chae, Heejoon" uniqKey="Chae H" first="Heejoon" last="Chae">Heejoon Chae</name>
<affiliation>
<nlm:aff id="gkt144-AFF1">Department of Computer Science, School of Informatics and Computing, Indiana University, Bloomington, IN, USA,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Park, Jinwoo" sort="Park, Jinwoo" uniqKey="Park J" first="Jinwoo" last="Park">Jinwoo Park</name>
<affiliation>
<nlm:aff id="gkt144-AFF1">Department of Computer Science and Engineering, Bioinformatics Institute, Seoul National University, Seoul, Korea,</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="gkt144-AFF1">Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Lee, Seong Whan" sort="Lee, Seong Whan" uniqKey="Lee S" first="Seong-Whan" last="Lee">Seong-Whan Lee</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="gkt144-AFF1">Department of Brain and Cognitive Engineering, Korea University, Seoul, Korea</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Nephew, Kenneth P" sort="Nephew, Kenneth P" uniqKey="Nephew K" first="Kenneth P." last="Nephew">Kenneth P. Nephew</name>
<affiliation>
<nlm:aff id="gkt144-AFF1">Medical Sciences Program, Indiana University School of Medicine, Indiana University, Bloomington, IN, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kim, Sun" sort="Kim, Sun" uniqKey="Kim S" first="Sun" last="Kim">Sun Kim</name>
<affiliation>
<nlm:aff id="gkt144-AFF1">Department of Computer Science and Engineering, Bioinformatics Institute, Seoul National University, Seoul, Korea,</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="gkt144-AFF1">Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea,</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">23519616</idno>
<idno type="pmc">3643570</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3643570</idno>
<idno type="RBID">PMC:3643570</idno>
<idno type="doi">10.1093/nar/gkt144</idno>
<date when="2013">2013</date>
<idno type="wicri:Area/Pmc/Corpus">000F51</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000F51</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Comparative analysis using K-mer and K-flank patterns provides evidence for CpG island sequence evolution in mammalian genomes</title>
<author>
<name sortKey="Chae, Heejoon" sort="Chae, Heejoon" uniqKey="Chae H" first="Heejoon" last="Chae">Heejoon Chae</name>
<affiliation>
<nlm:aff id="gkt144-AFF1">Department of Computer Science, School of Informatics and Computing, Indiana University, Bloomington, IN, USA,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Park, Jinwoo" sort="Park, Jinwoo" uniqKey="Park J" first="Jinwoo" last="Park">Jinwoo Park</name>
<affiliation>
<nlm:aff id="gkt144-AFF1">Department of Computer Science and Engineering, Bioinformatics Institute, Seoul National University, Seoul, Korea,</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="gkt144-AFF1">Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Lee, Seong Whan" sort="Lee, Seong Whan" uniqKey="Lee S" first="Seong-Whan" last="Lee">Seong-Whan Lee</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="gkt144-AFF1">Department of Brain and Cognitive Engineering, Korea University, Seoul, Korea</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Nephew, Kenneth P" sort="Nephew, Kenneth P" uniqKey="Nephew K" first="Kenneth P." last="Nephew">Kenneth P. Nephew</name>
<affiliation>
<nlm:aff id="gkt144-AFF1">Medical Sciences Program, Indiana University School of Medicine, Indiana University, Bloomington, IN, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kim, Sun" sort="Kim, Sun" uniqKey="Kim S" first="Sun" last="Kim">Sun Kim</name>
<affiliation>
<nlm:aff id="gkt144-AFF1">Department of Computer Science and Engineering, Bioinformatics Institute, Seoul National University, Seoul, Korea,</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="gkt144-AFF1">Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea,</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Nucleic Acids Research</title>
<idno type="ISSN">0305-1048</idno>
<idno type="eISSN">1362-4962</idno>
<imprint>
<date when="2013">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>CpG islands are GC-rich regions often located in the 5′ end of genes and normally protected from cytosine methylation in mammals. The important role of CpG islands in gene transcription strongly suggests evolutionary conservation in the mammalian genome. However, as CpG dinucleotides are over-represented in CpG islands, comparative CpG island analysis using conventional sequence analysis techniques remains a major challenge in the epigenetics field. In this study, we conducted a comparative analysis of all CpG island sequences in 10 mammalian genomes. As sequence similarity methods and character composition techniques such as information theory are particularly difficult to conduct, we used exact patterns in CpG island sequences and single character discrepancies to identify differences in CpG island sequences. First, by calculating genome distance based on rank correlation tests, we show that k-mer and k-flank patterns around CpG sites can be used to correctly reconstruct the phylogeny of 10 mammalian genomes. Further, we used various machine learning algorithms to demonstrate that CpG islands sequences can be characterized using k-mers. In addition, by testing a human model on the nine different mammalian genomes, we provide the first evidence that k-mer signatures are consistent with evolutionary history.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Jabbari, K" uniqKey="Jabbari K">K Jabbari</name>
</author>
<author>
<name sortKey="Bernardi, G" uniqKey="Bernardi G">G Bernardi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jones, Pa" uniqKey="Jones P">PA Jones</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bell, Cg" uniqKey="Bell C">CG Bell</name>
</author>
<author>
<name sortKey="Wilson, Ga" uniqKey="Wilson G">GA Wilson</name>
</author>
<author>
<name sortKey="Butcher, Lm" uniqKey="Butcher L">LM Butcher</name>
</author>
<author>
<name sortKey="Roos, C" uniqKey="Roos C">C Roos</name>
</author>
<author>
<name sortKey="Walter, L" uniqKey="Walter L">L Walter</name>
</author>
<author>
<name sortKey="Beck, S" uniqKey="Beck S">S Beck</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Portela, A" uniqKey="Portela A">A Portela</name>
</author>
<author>
<name sortKey="Esteller, M" uniqKey="Esteller M">M Esteller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Feinberg, Ap" uniqKey="Feinberg A">AP Feinberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Burge, C" uniqKey="Burge C">C Burge</name>
</author>
<author>
<name sortKey="Campbell, Am" uniqKey="Campbell A">AM Campbell</name>
</author>
<author>
<name sortKey="Karlin, S" uniqKey="Karlin S">S Karlin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Scarano, E" uniqKey="Scarano E">E Scarano</name>
</author>
<author>
<name sortKey="Iaccarino, M" uniqKey="Iaccarino M">M Iaccarino</name>
</author>
<author>
<name sortKey="Grippo, P" uniqKey="Grippo P">P Grippo</name>
</author>
<author>
<name sortKey="Parisi, E" uniqKey="Parisi E">E Parisi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bock, C" uniqKey="Bock C">C Bock</name>
</author>
<author>
<name sortKey="Paulsen, M" uniqKey="Paulsen M">M Paulsen</name>
</author>
<author>
<name sortKey="Tierling, S" uniqKey="Tierling S">S Tierling</name>
</author>
<author>
<name sortKey="Mikeska, T" uniqKey="Mikeska T">T Mikeska</name>
</author>
<author>
<name sortKey="Lengauer, T" uniqKey="Lengauer T">T Lengauer</name>
</author>
<author>
<name sortKey="Walter, J" uniqKey="Walter J">J Walter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fatemi, M" uniqKey="Fatemi M">M Fatemi</name>
</author>
<author>
<name sortKey="Pao, Mm" uniqKey="Pao M">MM Pao</name>
</author>
<author>
<name sortKey="Jeong, S" uniqKey="Jeong S">S Jeong</name>
</author>
<author>
<name sortKey="Gal Yam, En" uniqKey="Gal Yam E">EN Gal-Yam</name>
</author>
<author>
<name sortKey="Egger, G" uniqKey="Egger G">G Egger</name>
</author>
<author>
<name sortKey="Weisenberger, Dj" uniqKey="Weisenberger D">DJ Weisenberger</name>
</author>
<author>
<name sortKey="Jones, Pa" uniqKey="Jones P">PA Jones</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Larsen, F" uniqKey="Larsen F">F Larsen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Saxonov, S" uniqKey="Saxonov S">S Saxonov</name>
</author>
<author>
<name sortKey="Berg, P" uniqKey="Berg P">P Berg</name>
</author>
<author>
<name sortKey="Brutlag, Dl" uniqKey="Brutlag D">DL Brutlag</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Antequera, F" uniqKey="Antequera F">F Antequera</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sharif, J" uniqKey="Sharif J">J Sharif</name>
</author>
<author>
<name sortKey="Endo, Ta" uniqKey="Endo T">TA Endo</name>
</author>
<author>
<name sortKey="Toyoda, T" uniqKey="Toyoda T">T Toyoda</name>
</author>
<author>
<name sortKey="Koseki, H" uniqKey="Koseki H">H Koseki</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yan, Q" uniqKey="Yan Q">Q Yan</name>
</author>
<author>
<name sortKey="Masson, R" uniqKey="Masson R">R Masson</name>
</author>
<author>
<name sortKey="Ren, Y" uniqKey="Ren Y">Y Ren</name>
</author>
<author>
<name sortKey="Rosati, B" uniqKey="Rosati B">B Rosati</name>
</author>
<author>
<name sortKey="Mckinnon, D" uniqKey="Mckinnon D">D McKinnon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gardiner Garden, M" uniqKey="Gardiner Garden M">M Gardiner-Garden</name>
</author>
<author>
<name sortKey="Frommer, M" uniqKey="Frommer M">M Frommer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Takai, D" uniqKey="Takai D">D Takai</name>
</author>
<author>
<name sortKey="Jones, Pa" uniqKey="Jones P">PA Jones</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bock, C" uniqKey="Bock C">C Bock</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wu, H" uniqKey="Wu H">H Wu</name>
</author>
<author>
<name sortKey="Caffo, B" uniqKey="Caffo B">B Caffo</name>
</author>
<author>
<name sortKey="Jaffee, Ha" uniqKey="Jaffee H">HA Jaffee</name>
</author>
<author>
<name sortKey="Irizarry, Ra" uniqKey="Irizarry R">RA Irizarry</name>
</author>
<author>
<name sortKey="Feinberg, Ap" uniqKey="Feinberg A">AP Feinberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Feuerbach, L" uniqKey="Feuerbach L">L Feuerbach</name>
</author>
<author>
<name sortKey="Halachev, K" uniqKey="Halachev K">K Halachev</name>
</author>
<author>
<name sortKey="Assenov, Y" uniqKey="Assenov Y">Y Assenov</name>
</author>
<author>
<name sortKey="Mller, F" uniqKey="Mller F">F Mller</name>
</author>
<author>
<name sortKey="Bock, C" uniqKey="Bock C">C Bock</name>
</author>
<author>
<name sortKey="Lengauer, T" uniqKey="Lengauer T">T Lengauer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cohen, Nm" uniqKey="Cohen N">NM Cohen</name>
</author>
<author>
<name sortKey="Kenigsberg, E" uniqKey="Kenigsberg E">E Kenigsberg</name>
</author>
<author>
<name sortKey="Tanay, A" uniqKey="Tanay A">A Tanay</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nussinov, R" uniqKey="Nussinov R">R Nussinov</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Saitou, N" uniqKey="Saitou N">N Saitou</name>
</author>
<author>
<name sortKey="Nei, M" uniqKey="Nei M">M Nei</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kendall, Mg" uniqKey="Kendall M">MG Kendall</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kendall, Mg" uniqKey="Kendall M">MG Kendall</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fredsiund, J" uniqKey="Fredsiund J">J Fredsiund</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Witten, Lh" uniqKey="Witten L">LH Witten</name>
</author>
<author>
<name sortKey="Frank, E" uniqKey="Frank E">E Frank</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Miele, V" uniqKey="Miele V">V Miele</name>
</author>
<author>
<name sortKey="Bourguignon, Py" uniqKey="Bourguignon P">PY Bourguignon</name>
</author>
<author>
<name sortKey="Robelin, D" uniqKey="Robelin D">D Robelin</name>
</author>
<author>
<name sortKey="Nuel, G" uniqKey="Nuel G">G Nuel</name>
</author>
<author>
<name sortKey="Richard, H" uniqKey="Richard H">H Richard</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Nucleic Acids Res</journal-id>
<journal-id journal-id-type="iso-abbrev">Nucleic Acids Res</journal-id>
<journal-id journal-id-type="publisher-id">nar</journal-id>
<journal-id journal-id-type="hwp">nar</journal-id>
<journal-title-group>
<journal-title>Nucleic Acids Research</journal-title>
</journal-title-group>
<issn pub-type="ppub">0305-1048</issn>
<issn pub-type="epub">1362-4962</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">23519616</article-id>
<article-id pub-id-type="pmc">3643570</article-id>
<article-id pub-id-type="doi">10.1093/nar/gkt144</article-id>
<article-id pub-id-type="publisher-id">gkt144</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Gene Regulation, Chromatin and Epigenetics</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Comparative analysis using K-mer and K-flank patterns provides evidence for CpG island sequence evolution in mammalian genomes</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Chae</surname>
<given-names>Heejoon</given-names>
</name>
<xref ref-type="aff" rid="gkt144-AFF1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Park</surname>
<given-names>Jinwoo</given-names>
</name>
<xref ref-type="aff" rid="gkt144-AFF1">
<sup>2</sup>
</xref>
<xref ref-type="aff" rid="gkt144-AFF1">
<sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Lee</surname>
<given-names>Seong-Whan</given-names>
</name>
<xref ref-type="aff" rid="gkt144-AFF1">
<sup>4</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Nephew</surname>
<given-names>Kenneth P.</given-names>
</name>
<xref ref-type="aff" rid="gkt144-AFF1">
<sup>5</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Kim</surname>
<given-names>Sun</given-names>
</name>
<xref ref-type="aff" rid="gkt144-AFF1">
<sup>2</sup>
</xref>
<xref ref-type="aff" rid="gkt144-AFF1">
<sup>3</sup>
</xref>
<xref ref-type="corresp" rid="gkt144-COR1">*</xref>
</contrib>
</contrib-group>
<aff id="gkt144-AFF1">
<sup>1</sup>
Department of Computer Science, School of Informatics and Computing, Indiana University, Bloomington, IN, USA,
<sup>2</sup>
Department of Computer Science and Engineering, Bioinformatics Institute, Seoul National University, Seoul, Korea,
<sup>3</sup>
Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea,
<sup>4</sup>
Department of Brain and Cognitive Engineering, Korea University, Seoul, Korea and
<sup>5</sup>
Medical Sciences Program, Indiana University School of Medicine, Indiana University, Bloomington, IN, USA</aff>
<author-notes>
<corresp id="gkt144-COR1">*To whom correspondence should be addressed. Tel:
<phone>+82 2 880 7280</phone>
; Fax:
<fax>+82 2 886 7589</fax>
; Email:
<email>sunkim.bioinfo@snu.ac.kr</email>
</corresp>
</author-notes>
<pub-date pub-type="ppub">
<month>5</month>
<year>2013</year>
</pub-date>
<pub-date pub-type="epub">
<day>20</day>
<month>3</month>
<year>2013</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>20</day>
<month>3</month>
<year>2013</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>41</volume>
<issue>9</issue>
<fpage>4783</fpage>
<lpage>4791</lpage>
<history>
<date date-type="received">
<day>27</day>
<month>10</month>
<year>2012</year>
</date>
<date date-type="rev-recd">
<day>24</day>
<month>1</month>
<year>2013</year>
</date>
<date date-type="accepted">
<day>13</day>
<month>2</month>
<year>2013</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s) 2013. Published by Oxford University Press.</copyright-statement>
<copyright-year>2013</copyright-year>
<license license-type="creative-commons" xlink:href="http://creativecommons.org/licenses/by-nc/3.0/">
<license-p>
<pmc-comment>CREATIVE COMMONS</pmc-comment>
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/3.0/">http://creativecommons.org/licenses/by-nc/3.0/</ext-link>
), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract>
<p>CpG islands are GC-rich regions often located in the 5′ end of genes and normally protected from cytosine methylation in mammals. The important role of CpG islands in gene transcription strongly suggests evolutionary conservation in the mammalian genome. However, as CpG dinucleotides are over-represented in CpG islands, comparative CpG island analysis using conventional sequence analysis techniques remains a major challenge in the epigenetics field. In this study, we conducted a comparative analysis of all CpG island sequences in 10 mammalian genomes. As sequence similarity methods and character composition techniques such as information theory are particularly difficult to conduct, we used exact patterns in CpG island sequences and single character discrepancies to identify differences in CpG island sequences. First, by calculating genome distance based on rank correlation tests, we show that k-mer and k-flank patterns around CpG sites can be used to correctly reconstruct the phylogeny of 10 mammalian genomes. Further, we used various machine learning algorithms to demonstrate that CpG islands sequences can be characterized using k-mers. In addition, by testing a human model on the nine different mammalian genomes, we provide the first evidence that k-mer signatures are consistent with evolutionary history.</p>
</abstract>
<counts>
<page-count count="9"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec>
<title>INTRODUCTION</title>
<p>The dinucleotide sequence CpG (cytosine followed by a guanine, coupled by a phosphodiester bond) is a target for DNA methylation. The cytosine residue in CpG sites is frequently modified to form 5-methylcytosine, and 70–80% of CpG dinucleotides in the mammalian genomes are methylated (
<xref ref-type="bibr" rid="gkt144-B1">1</xref>
). DNA methylation is essential for proper mammalian development and plays crucial roles in imprinting, maintaining genomic stability and many other biological processes (
<xref ref-type="bibr" rid="gkt144-B2">2</xref>
). In addition, aberrant DNA methylation changes have been detected in several diseases (
<xref ref-type="bibr" rid="gkt144-B3 gkt144-B4 gkt144-B5">3–5</xref>
). In the human genome, spontaneous deamination of methylated cytosine in the context of CpG dinucleotides results in the creation of thymine (
<inline-formula>
<inline-graphic xlink:href="gkt144i1.jpg"></inline-graphic>
</inline-formula>
) and under-representation of CG dinucleotides over evolutionary time (known as CG suppression) (
<xref ref-type="bibr" rid="gkt144-B6">6</xref>
,
<xref ref-type="bibr" rid="gkt144-B7">7</xref>
). In fact, the frequency of CpG sites in vertebrate genomes is only about a fifth of the expected frequency, given the GC content of the genome. Although CpG sites are under-represented in genomes overall, clusters of CpGs known as CpG islands are observed, and these are normally protected from methylation (
<xref ref-type="bibr" rid="gkt144-B8">8</xref>
). The vast majority of genes are associated with a CpG islands, and ∼40% of gene promoters contain a CpG island (
<xref ref-type="bibr" rid="gkt144-B9">9</xref>
), including the 5′ ends of housekeeping genes and many tissue-specific genes in vertebrates (
<xref ref-type="bibr" rid="gkt144-B10">10</xref>
). Recently, human promoters were classified as high CpG content (about 70%) versus 30% low CpG promoters (CpG content characteristic of the overall genome) (
<xref ref-type="bibr" rid="gkt144-B11">11</xref>
). Comparative studies (
<xref ref-type="bibr" rid="gkt144-B12">12</xref>
) on CpG island promoter organization, in terms of protein–DNA interactions and patterns of expression, recently reported a strong link between CpG islands and evolution and that accumulation of CpG islands at transcription start sites (TSS) is a vertebrate-specific genomic feature. Those authors (
<xref ref-type="bibr" rid="gkt144-B12">12</xref>
) suggested that CpG islands at TSS are a consequence of warm-blooded vertebrate evolution, presumably for efficient regulation of transcription in large genomes. On the other hand, CpG islands could have played a direct role in evolution of warm-blooded vertebrates, perhaps contributing to the gain of placenta, a hallmark of eutherian mammals (
<xref ref-type="bibr" rid="gkt144-B13">13</xref>
). In support of the latter, a relationship between evolution of CpG island promoter function and gene expression in mammalian heart was recently reported (
<xref ref-type="bibr" rid="gkt144-B14">14</xref>
).</p>
<p>To date, objective definitions of CpG islands are limited. Gardiner-Garden and Frommer (
<xref ref-type="bibr" rid="gkt144-B15">15</xref>
) described CpG islands as a region with at least 200 bp, a GC content >50% and an observed/expected CpG ratio >60%. Takai and Jones (
<xref ref-type="bibr" rid="gkt144-B16">16</xref>
) revised the definition of CpG islands as DNA regions with at least 500 bp, a GC content >55% and an observed/expected CpG ratio >65% were more likely to be true CpG islands associated with the 5′ end regions of genes. They also enhanced the ability to detect CpG islands by excluding other GC-rich genomic sequences such as Alu repeats (
<xref ref-type="bibr" rid="gkt144-B17">17</xref>
,
<xref ref-type="bibr" rid="gkt144-B18">18</xref>
). Despite significant efforts to define CpG islands, it remains a challenge to perform computational CpG island analysis using conventional sequence analysis methods.</p>
<p>To overcome this barrier, we propose novel oligomer-counting approaches for the comparative analysis of all CpG island sequences in 10 mammalian genomes. These two new approaches use exact sequence patterns in CpG island sequences called k-mer and k-flank. After counting the k-mers and k-flanks, pattern counting was used to reconstruct 10 mammalian phylogenies and for machine learning analysis. We demonstrate that k-mers are characteristic of CpG island sequences and also show that k-mer data are consistent with evolutionary history of 10 mammalian genomes. To our knowledge, this extensive study represents the first comparative analysis of CpG island sequences in mammalian genomes.</p>
</sec>
<sec sec-type="materials|methods">
<title>MATERIALS AND METHODS</title>
<sec>
<title>Raw data</title>
<p>From the UCSC Genome Browser, CpG island sequences in 10 mammalian were downloaded and used for the analysis.
<xref ref-type="table" rid="gkt144-T1">Table 1</xref>
shows 10 mammalian reference genomes and their versions from which the CpG island sequences are taken.
<table-wrap id="gkt144-T1" position="float">
<label>Table 1.</label>
<caption>
<p>Ten mammalian CpG islands sequence</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">Species</th>
<th rowspan="1" colspan="1">Data version</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">Chimp</td>
<td rowspan="1" colspan="1">CGSC 2.1.3/panTro3</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Cow</td>
<td rowspan="1" colspan="1">Bos taurus UMD 3.1/bosTAu6</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Dog</td>
<td rowspan="1" colspan="1">Broad/canFam2</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Human</td>
<td rowspan="1" colspan="1">CRCh37/hg19</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Marmoset</td>
<td rowspan="1" colspan="1">WUGSC 3.2/calJac3</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Mouse</td>
<td rowspan="1" colspan="1">NCBI37/mm9</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Opossum</td>
<td rowspan="1" colspan="1">Broad/monDom5</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Pig</td>
<td rowspan="1" colspan="1">SGSC Sscrofa9.2/susScr2</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Rat</td>
<td rowspan="1" colspan="1">Baylor 3.4/rn4</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Rhesus</td>
<td rowspan="1" colspan="1">MGSC Merged 1.0/rheMac2</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
<sec>
<title>BLAST and information theoretic approaches are not effective for CpG island sequence analysis</title>
<p>Despite previous studies on CpG island sequences, it is unclear why CpG island regions are found more frequently in mammalian genomes compared with other genomes. In addition, while studies exist on CpG island sequences and evolution (
<xref ref-type="bibr" rid="gkt144-B19">19</xref>
,
<xref ref-type="bibr" rid="gkt144-B20">20</xref>
) in primates, comparative mammalian studies are lacking, perhaps owing to the difficulty of performing computational analysis of CpG island sequences containing over-represented CpG dinucleotides. In this regard, conventional sequence analysis techniques are not effective for the comparative analysis of highly similar CpG island sequences (shown in the next section).</p>
<p>We used BLAST for CpG island sequence analysis. Owing to over-represented CpG dinucleotides, CpG island sequences are very similar to each other, and sequence similarity-based methods like BLAST are not effective for performing comparative CpG island sequences analysis. We took another traditional approach and computed relative entropy between 10 mammalian species. As shown in
<xref ref-type="table" rid="gkt144-T2">Table 2</xref>
, computing the relative entropy was ineffective in showing significant differences among CpG island sequences. Shown below is the computational procedure we performed for the relative entropy between species.
<list list-type="roman-lower">
<list-item>
<p>Let P and Q be probability distributions of CpG island for chimp and human</p>
</list-item>
<list-item>
<p>Get probabilities of A, G, T, C for chimp and human in each CpG island sequence. Let them be
<inline-formula>
<inline-graphic xlink:href="gkt144i2.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="gkt144i3.jpg"></inline-graphic>
</inline-formula>
</p>
</list-item>
<list-item>
<p>Relative entropy is calculated by</p>
</list-item>
<list-item>
<p>
<inline-formula>
<inline-graphic xlink:href="gkt144i4.jpg"></inline-graphic>
</inline-formula>
</p>
</list-item>
<list-item>
<p>Repeat (i) to (iii) for all species pairwise</p>
</list-item>
</list>
<table-wrap id="gkt144-T2" position="float">
<label>Table 2.</label>
<caption>
<p>Relative entropy between species</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">Species</th>
<th rowspan="1" colspan="1">Chimp</th>
<th rowspan="1" colspan="1">Cow</th>
<th rowspan="1" colspan="1">Dog</th>
<th rowspan="1" colspan="1">Human</th>
<th rowspan="1" colspan="1">Marmoset</th>
<th rowspan="1" colspan="1">Mouse</th>
<th rowspan="1" colspan="1">Opossum</th>
<th rowspan="1" colspan="1">Pig</th>
<th rowspan="1" colspan="1">Rat</th>
<th rowspan="1" colspan="1">Rhesus</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">Chimp</td>
<td rowspan="1" colspan="1">0.00E + 00</td>
<td rowspan="1" colspan="1">6.52E-05</td>
<td rowspan="1" colspan="1">1.24E-05</td>
<td rowspan="1" colspan="1">2.61E-05</td>
<td rowspan="1" colspan="1">3.38E-04</td>
<td rowspan="1" colspan="1">2.61E-05</td>
<td rowspan="1" colspan="1">8.42E-04</td>
<td rowspan="1" colspan="1">7.18E-04</td>
<td rowspan="1" colspan="1">4.15E-04</td>
<td rowspan="1" colspan="1">2.20E-06</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Cow</td>
<td rowspan="1" colspan="1">6.51E-05</td>
<td rowspan="1" colspan="1">0.00E + 00</td>
<td rowspan="1" colspan="1">1.34E-04</td>
<td rowspan="1" colspan="1">1.71E-04</td>
<td rowspan="1" colspan="1">1.07E-04</td>
<td rowspan="1" colspan="1">1.31E-05</td>
<td rowspan="1" colspan="1">4.39E-04</td>
<td rowspan="1" colspan="1">3.54E-04</td>
<td rowspan="1" colspan="1">1.56E-04</td>
<td rowspan="1" colspan="1">4.59E-05</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Dog</td>
<td rowspan="1" colspan="1">1.24E-05</td>
<td rowspan="1" colspan="1">1.34E-04</td>
<td rowspan="1" colspan="1">0.00E + 00</td>
<td rowspan="1" colspan="1">4.47E-06</td>
<td rowspan="1" colspan="1">4.80E-04</td>
<td rowspan="1" colspan="1">7.28E-05</td>
<td rowspan="1" colspan="1">1.06E-03</td>
<td rowspan="1" colspan="1">9.19E-04</td>
<td rowspan="1" colspan="1">5.69E-04</td>
<td rowspan="1" colspan="1">2.41E-05</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Human</td>
<td rowspan="1" colspan="1">2.62E-05</td>
<td rowspan="1" colspan="1">1.72E-04</td>
<td rowspan="1" colspan="1">4.48E-06</td>
<td rowspan="1" colspan="1">0.00E + 00</td>
<td rowspan="1" colspan="1">5.48E-04</td>
<td rowspan="1" colspan="1">1.00E-04</td>
<td rowspan="1" colspan="1">1.15E-03</td>
<td rowspan="1" colspan="1">1.00E-03</td>
<td rowspan="1" colspan="1">6.43E-04</td>
<td rowspan="1" colspan="1">4.12E-05</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Marmoset</td>
<td rowspan="1" colspan="1">3.37E-04</td>
<td rowspan="1" colspan="1">1.07E-04</td>
<td rowspan="1" colspan="1">4.77E-04</td>
<td rowspan="1" colspan="1">5.44E-04</td>
<td rowspan="1" colspan="1">0.00E + 00</td>
<td rowspan="1" colspan="1">1.80E-04</td>
<td rowspan="1" colspan="1">1.12E-04</td>
<td rowspan="1" colspan="1">7.23E-05</td>
<td rowspan="1" colspan="1">6.83E-06</td>
<td rowspan="1" colspan="1">2.88E-04</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Mouse</td>
<td rowspan="1" colspan="1">2.61E-05</td>
<td rowspan="1" colspan="1">1.31E-05</td>
<td rowspan="1" colspan="1">7.26E-05</td>
<td rowspan="1" colspan="1">1.00E-04</td>
<td rowspan="1" colspan="1">1.81E-04</td>
<td rowspan="1" colspan="1">0.00E + 00</td>
<td rowspan="1" colspan="1">5.79E-04</td>
<td rowspan="1" colspan="1">4.75E-04</td>
<td rowspan="1" colspan="1">2.35E-04</td>
<td rowspan="1" colspan="1">1.33E-05</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Opossum</td>
<td rowspan="1" colspan="1">8.35E-04</td>
<td rowspan="1" colspan="1">4.36E-04</td>
<td rowspan="1" colspan="1">1.04E-03</td>
<td rowspan="1" colspan="1">1.14E-03</td>
<td rowspan="1" colspan="1">1.12E-04</td>
<td rowspan="1" colspan="1">5.75E-04</td>
<td rowspan="1" colspan="1">0.00E + 00</td>
<td rowspan="1" colspan="1">8.32E-06</td>
<td rowspan="1" colspan="1">8.11E-05</td>
<td rowspan="1" colspan="1">7.58E-04</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Pig</td>
<td rowspan="1" colspan="1">7.12E-04</td>
<td rowspan="1" colspan="1">3.52E-04</td>
<td rowspan="1" colspan="1">9.11E-04</td>
<td rowspan="1" colspan="1">9.97E-04</td>
<td rowspan="1" colspan="1">7.21E-05</td>
<td rowspan="1" colspan="1">4.72E-04</td>
<td rowspan="1" colspan="1">8.34E-06</td>
<td rowspan="1" colspan="1">0.00E + 00</td>
<td rowspan="1" colspan="1">4.62E-05</td>
<td rowspan="1" colspan="1">6.40E-04</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Rat</td>
<td rowspan="1" colspan="1">4.13E-04</td>
<td rowspan="1" colspan="1">1.56E-04</td>
<td rowspan="1" colspan="1">5.64E-04</td>
<td rowspan="1" colspan="1">6.38E-04</td>
<td rowspan="1" colspan="1">6.83E-06</td>
<td rowspan="1" colspan="1">2.34E-04</td>
<td rowspan="1" colspan="1">8.12E-05</td>
<td rowspan="1" colspan="1">4.63E-05</td>
<td rowspan="1" colspan="1">0.00E + 00</td>
<td rowspan="1" colspan="1">3.57E-04</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Rhesus</td>
<td rowspan="1" colspan="1">2.20E-06</td>
<td rowspan="1" colspan="1">4.60E-05</td>
<td rowspan="1" colspan="1">2.41E-05</td>
<td rowspan="1" colspan="1">4.11E-05</td>
<td rowspan="1" colspan="1">2.90E-04</td>
<td rowspan="1" colspan="1">1.33E-05</td>
<td rowspan="1" colspan="1">7.64E-04</td>
<td rowspan="1" colspan="1">6.45E-04</td>
<td rowspan="1" colspan="1">3.59E-04</td>
<td rowspan="1" colspan="1">0.00E + 00</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="gkt144-TF1">
<p>No significant difference enough to represent CpG island sequence features.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec>
<title>The k-mer and k-flank approaches</title>
<p>As shown in the previous section, over-represented CpG dinucleotides make it difficult to perform analysis of CpG islands. To overcome this hurdle, we used two new oligomer-counting approaches, exact sequence patterns called k-mer and k-flank, for the comparative analysis of all CpG island sequences in 10 mammalian genomes. The first model, k-mer, gives a general descriptor of the oligomer landscape in the entire CpG island sequence. Given a sequence S, k-mers are sub-sequences of S of length k, also known as oligomers for small k. For each CpG island sequence, sliding windows of length k are moved across the CpG island sequence from the 5′ end to 3′ end, and each k-mer occurrence is counted. Determining the number of occurrences of specific k-mers in a sequence is called k-mer counting or oligomer counting, and can provide descriptive information about the DNA sequence (
<xref ref-type="bibr" rid="gkt144-B6">6</xref>
,
<xref ref-type="bibr" rid="gkt144-B21">21</xref>
). To better characterize and describe CpG island sequences, we used k-mer counting techniques and frequency measurements to perform a comparative analysis of the CpG islands. The second oligomer-counting approach, called k-flanks, records the DNA sequence of k length directly upstream and downstream of each CpG site in a CpG island sequence. This oligomer model is stricter and specifically describes the DNA bases directly adjacent to CpG sites.
<xref ref-type="fig" rid="gkt144-F1">Figure 1</xref>
illustrates the definition of k-mer and k-flank. In this study, we counted 3–10 k-mers and k-flanks. Once k-mers and k-flanks were counted, we used pattern counting for the reconstruction of 10 mammalian phylogenies and also for machine learning analysis to show that (i) k-mers are characteristic of CpG island sequences and (ii) k-mer data are consistent with evolutionary history of 10 mammalian genomes. As far as we know, this study is the first extensive comparative analysis of CpG island sequences.
<fig id="gkt144-F1" position="float">
<label>Figure 1.</label>
<caption>
<p>Definition of K-mer and K-flank.</p>
</caption>
<graphic xlink:href="gkt144f1p"></graphic>
</fig>
</p>
</sec>
<sec>
<title>Collecting k-mer/k-flank frequencies and common k-mers/k-flanks</title>
<p>
<list list-type="roman-lower">
<list-item>
<p>Count k-mer and k-flank frequencies in CpG island for each species</p>
</list-item>
<list-item>
<p>Collect k-mers and k-flanks common in all 10 mammalian genomes and their frequencies</p>
</list-item>
</list>
</p>
<p>Once the common k-mers/k-flanks are collected, they are sorted according to their frequencies. Based on the ranks of common k-mers/k-flanks, their ranks are marked. Since these k-mers/k-flanks are in common across all 10 mammalian genomes, the order of k-mers/k-flanks in each species becomes a permutation of ranks of common k-mers/k-flanks. This k-mer/k-flank ranking method based on the common k-mer/k-flank ranks comes from an assumption that all these mammalian species are closely related from the evolutionary perspective.
<xref ref-type="fig" rid="gkt144-F2">Figure 2</xref>
illustrates the experimental protocol.
<fig id="gkt144-F2" position="float">
<label>Figure 2.</label>
<caption>
<p>Get common k-mer(flank) exist in all the species. Each alphabet stands for the k-mer(flank) sequences. Based on the k-mer(flank) rank order in common, k-mer(flank) order in each species are set. Rank difference between common and other species are computed.</p>
</caption>
<graphic xlink:href="gkt144f2p"></graphic>
</fig>
</p>
</sec>
<sec>
<title>Correlation between species</title>
<sec>
<title>k-mer/k-flank selection for correlation</title>
<p>Based on the common k-mers/k-flanks, k-mers/k-flanks in the each species were selected for correlation analysis between species. That is, for the correlation coefficient computation, each species has the same set of k-mers/k-flanks but a different rank order 2.</p>
</sec>
<sec>
<title>Kendall tau rank correlation coefficient with merge sort</title>
<p>Kendalls tau is a method to measure rank correlation, first discussed by G.T. Fechner in 1900 and rediscovered by M.G. Kendall in 1938 (
<xref ref-type="bibr" rid="gkt144-B23">23</xref>
,
<xref ref-type="bibr" rid="gkt144-B24">24</xref>
). It is a statistic used to measure the association between two measured quantities, effectively measuring rank correlation.</p>
</sec>
</sec>
</sec>
<sec>
<title>RESULTS AND DISCUSSION</title>
<sec>
<title>Reconstruction of the phylogenetic tree based on the distance matrix</title>
<p>Distance matrix can be directly obtained from the correlation coefficient value computed by Kendalls tau method. If the correlation coefficient value is
<inline-formula>
<inline-graphic xlink:href="gkt144i5.jpg"></inline-graphic>
</inline-formula>
, the distance matrix will be (
<inline-formula>
<inline-graphic xlink:href="gkt144i6.jpg"></inline-graphic>
</inline-formula>
).</p>
</sec>
<sec>
<title>Neighbour-joining algorithm</title>
<p>We applied a neighbour-joining algorithm (
<xref ref-type="bibr" rid="gkt144-B22">22</xref>
) to reconstruct phylogenetic trees using matrix of pairwise evolutionary between distance. Our objective here was to define distance between genomes by using the rank-sum tests of conserved k-mers.</p>
</sec>
<sec>
<title>Visualize phylogenetic tree</title>
<p>Using the Newick format data, we used the on-line phylogeny drawing tool, called PHY-FI (
<xref ref-type="bibr" rid="gkt144-B25">25</xref>
). A phylogenetic tree using top ranked 64 3-mers is shown in
<xref ref-type="fig" rid="gkt144-F3">Figure 3</xref>
, which is consistent with the evolutionary history of the 10 mammalian genomes. Additional results using different k-mers and k-flanks are available at
<ext-link ext-link-type="uri" xlink:href="http://biohealth.snu.ac.kr/wiki/index.php/PhylogeneticTree">http://biohealth.snu.ac.kr/wiki/index.php/PhylogeneticTree</ext-link>
<fig id="gkt144-F3" position="float">
<label>Figure 3.</label>
<caption>
<p>Phylogenetic tree using top ranked 64 3-mers.</p>
</caption>
<graphic xlink:href="gkt144f3p"></graphic>
</fig>
</p>
</sec>
<sec>
<title>Machine learning analysis of CpG island sequence using k-mer as features</title>
<p>In the previous section, the k-mer/k-flank rank method was used to reconstruct a phylogenetic tree based on their sequence pattern. To investigate further relationships among CpG island sequence patterns and evolutionary relationship between species, we performed machine learning analysis on the CpG island sequences using the same k-mer frequency approach.</p>
</sec>
<sec>
<title>Machine learning algorithms</title>
<p>We performed the machine learning analysis using representative algorithms including Random forest (RF), naïve bayes (NB), support vector machine (SVM) and radial basis function network (RBF) implemented in Weka package (
<xref ref-type="bibr" rid="gkt144-B26">26</xref>
). We did not use the artificial neural network (ANN) algorithm, as it is exceptionally time consuming. In the machine learning analysis, statistically significant k-mers were used as features, and CpG island sequences from each species were used as class.</p>
</sec>
<sec>
<title>Positive and negative data set</title>
<p>Based on the k-th order, the frequency of each k-mers in the CpG islands was counted and used as the positive data set to represent CpG island sequence. We used Markov model (MM) in seq++ package (
<xref ref-type="bibr" rid="gkt144-B27">27</xref>
) to generate the negative data set. From CpG island sequences, the Markov model parameters were estimated and the negative sequences were generated using the Markov model. We generated random sequences of the same length as original CpG island sequences to make the positive and negative data of the same size. The ratio between positive and negative data set was one to one. Using the positive and negative data sets, we performed a 10-fold cross validation to evaluate machine learning models to characterize CpG island sequences.</p>
</sec>
<sec>
<title>Feature selections</title>
<p>To select statistically significant k-mers, t-tests were used and a
<italic>P</italic>
-value of 0.05 was used as a cut-off value. For the machine learning models, only statistically significant k-mers were used.</p>
</sec>
<sec>
<title>MM order and k-mer selection</title>
<p>Selecting the appropriate k-mer length and the order of MM for generating random sequence are critically important. Thus, we investigated the number of statistically significant k-mers between original CpG island sequences and random sequences from varying degrees of MM.
<xref ref-type="table" rid="gkt144-T3">Table 3</xref>
shows the relationship between the orders of MM and k-mer lengths.
<table-wrap id="gkt144-T3" position="float">
<label>Table 3.</label>
<caption>
<p>Relation between MM order and k-mer</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">MM-order</th>
<th rowspan="1" colspan="1">2-mer</th>
<th rowspan="1" colspan="1">3-mer</th>
<th rowspan="1" colspan="1">4-mer</th>
<th rowspan="1" colspan="1">5-mer</th>
<th rowspan="1" colspan="1">6-mer</th>
<th rowspan="1" colspan="1">7-mer</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">1st</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">50</td>
<td rowspan="1" colspan="1">202</td>
<td rowspan="1" colspan="1">830</td>
<td rowspan="1" colspan="1">2992</td>
<td rowspan="1" colspan="1">9570</td>
</tr>
<tr>
<td rowspan="1" colspan="1">2nd</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">193</td>
<td rowspan="1" colspan="1">699</td>
<td rowspan="1" colspan="1">2444</td>
<td rowspan="1" colspan="1">7950</td>
</tr>
<tr>
<td rowspan="1" colspan="1">3rd</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">456</td>
<td rowspan="1" colspan="1">1826</td>
<td rowspan="1" colspan="1">6120</td>
</tr>
<tr>
<td rowspan="1" colspan="1">4th</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">7</td>
<td rowspan="1" colspan="1">947</td>
<td rowspan="1" colspan="1">4707</td>
</tr>
<tr>
<td rowspan="1" colspan="1">5th</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">38</td>
<td rowspan="1" colspan="1">3520</td>
</tr>
<tr>
<td rowspan="1" colspan="1">6th</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">230</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="gkt144-TF2">
<p>Number of significant k-mer filtered by
<italic>t</italic>
-test with
<italic>P</italic>
< 0.05.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec sec-type="results">
<title>Results of machine learning on k-mer as a human CpG island sequence feature</title>
<p>To investigate predictive power of the machine learning models, we tested all possible combinations of the order of MM and k-mer length (
<xref ref-type="fig" rid="gkt144-F4 gkt144-F5 gkt144-F6 gkt144-F7 gkt144-F8 gkt144-F9 gkt144-F10 gkt144-F11">Figures 4–11</xref>
). Overall, SVM and RF performed better than NB and RBF, achieving prediction accuracies between 0.8 and 0.9. This result shows that CpG island sequences in humans contain distinctive k-mer patterns and are not random sequences.
<fig id="gkt144-F4" position="float">
<label>Figure 4.</label>
<caption>
<p>Machine learning performance with 4-mer as features using 2nd order MM random set.</p>
</caption>
<graphic xlink:href="gkt144f4p"></graphic>
</fig>
<fig id="gkt144-F5" position="float">
<label>Figure 5.</label>
<caption>
<p>Machine learning performance with 5-mer as features using 2nd order MM random set.</p>
</caption>
<graphic xlink:href="gkt144f5p"></graphic>
</fig>
<fig id="gkt144-F6" position="float">
<label>Figure 6.</label>
<caption>
<p>Machine learning performance with 6-mer as features using 2nd order MM random set.</p>
</caption>
<graphic xlink:href="gkt144f6p"></graphic>
</fig>
<fig id="gkt144-F7" position="float">
<label>Figure 7.</label>
<caption>
<p>Machine learning performance with 5-mer as features using 3rd order MM random set.</p>
</caption>
<graphic xlink:href="gkt144f7p"></graphic>
</fig>
<fig id="gkt144-F8" position="float">
<label>Figure 8.</label>
<caption>
<p>Machine learning performance with 6-mer as features using 3rd order MM random set.</p>
</caption>
<graphic xlink:href="gkt144f8p"></graphic>
</fig>
<fig id="gkt144-F9" position="float">
<label>Figure 9.</label>
<caption>
<p>Machine learning performance with 6-mer as features using 4th order MM random set.</p>
</caption>
<graphic xlink:href="gkt144f9p"></graphic>
</fig>
<fig id="gkt144-F10" position="float">
<label>Figure 10.</label>
<caption>
<p>Accuracy change over different k-mer length with 2nd order MM as negative set.</p>
</caption>
<graphic xlink:href="gkt144f10p"></graphic>
</fig>
<fig id="gkt144-F11" position="float">
<label>Figure 11.</label>
<caption>
<p>Accuracy change over different k-mer length with different MM order using SVM.</p>
</caption>
<graphic xlink:href="gkt144f11p"></graphic>
</fig>
</p>
</sec>
<sec>
<title>Analysis of CpG island sequences in nine mammalian genomes</title>
<p>We next extended the k-mer pattern analysis method used for the human to the other mammalian species (
<xref ref-type="table" rid="gkt144-T2">Table 2</xref>
). We fixed the machine learning algorithm as SVM, because it showed the best performance in our previous analysis. We also fixed parameters as 4-mer and the 2nd order for the MM negative data set, as 4-mer is the smallest k-mer to get meaningful result and at least 2nd order for the MM is required to simulate dinucleotide characteristics of CpG sites in CpG islands sequence.
<xref ref-type="fig" rid="gkt144-F12">Figure 12</xref>
shows the trained machines performance for other species. The overall performance was determined to be between 75–80%, demonstrating that the CpG island sequences contain certain unique pattern in each of the species.
<fig id="gkt144-F12" position="float">
<label>Figure 12.</label>
<caption>
<p>Machine learning k-mer analysis on each species.</p>
</caption>
<graphic xlink:href="gkt144f12p"></graphic>
</fig>
</p>
</sec>
<sec>
<title>Analysis of CpG island sequences using common k-mer patterns between species</title>
<p>Previous machine learning analysis showed that CpG island sequences of all species contained distinct and non-random k-mer pattern. To further investigate the k-mers, we analyzed the similarity k-mer patterns between species. The number of k-mer patterns in each species is conserved (
<xref ref-type="table" rid="gkt144-T4">Table 4</xref>
); in addition, k-mer patterns were shared between each species, ranging from 67 to 100% (
<xref ref-type="table" rid="gkt144-T5">Table 5</xref>
).
<table-wrap id="gkt144-T4" position="float">
<label>Table 4.</label>
<caption>
<p>Number of existing k-mer pattern in each species</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1">Chimp</th>
<th rowspan="1" colspan="1">Cow</th>
<th rowspan="1" colspan="1">Dog</th>
<th rowspan="1" colspan="1">Human</th>
<th rowspan="1" colspan="1">Marmoset</th>
<th rowspan="1" colspan="1">Mouse</th>
<th rowspan="1" colspan="1">Opossum</th>
<th rowspan="1" colspan="1">Pig</th>
<th rowspan="1" colspan="1">Rat</th>
<th rowspan="1" colspan="1">Rhesus</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">Total (255)</td>
<td rowspan="1" colspan="1">215</td>
<td rowspan="1" colspan="1">204</td>
<td rowspan="1" colspan="1">179</td>
<td rowspan="1" colspan="1">222</td>
<td rowspan="1" colspan="1">211</td>
<td rowspan="1" colspan="1">215</td>
<td rowspan="1" colspan="1">200</td>
<td rowspan="1" colspan="1">182</td>
<td rowspan="1" colspan="1">216</td>
<td rowspan="1" colspan="1">204</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="gkt144-TF3">
<p>Among total possible combination of 4-mer pattern (255), each species contains different number of k-mer pattern (
<italic>P</italic>
< 0.05).</p>
</fn>
</table-wrap-foot>
</table-wrap>
<table-wrap id="gkt144-T5" position="float">
<label>Table 5.</label>
<caption>
<p>Number of common k-mers between species</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">Species</th>
<th rowspan="1" colspan="1">Chimp</th>
<th rowspan="1" colspan="1">Cow</th>
<th rowspan="1" colspan="1">Dog</th>
<th rowspan="1" colspan="1">Human</th>
<th rowspan="1" colspan="1">Marmoset</th>
<th rowspan="1" colspan="1">Mouse</th>
<th rowspan="1" colspan="1">Opossum</th>
<th rowspan="1" colspan="1">Pig</th>
<th rowspan="1" colspan="1">Rat</th>
<th rowspan="1" colspan="1">Rhesus</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">Chimp</td>
<td rowspan="1" colspan="1">215/100%</td>
<td rowspan="1" colspan="1">197/88.7%</td>
<td rowspan="1" colspan="1">169/75.1%</td>
<td rowspan="1" colspan="1">213/95.0%</td>
<td rowspan="1" colspan="1">196/85.2%</td>
<td rowspan="1" colspan="1">200/86.9%</td>
<td rowspan="1" colspan="1">182/78.1%</td>
<td rowspan="1" colspan="1">173/77.2%</td>
<td rowspan="1" colspan="1">196/83.4%</td>
<td rowspan="1" colspan="1">202/93.0%</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Cow</td>
<td rowspan="1" colspan="1">197/88.7%</td>
<td rowspan="1" colspan="1">204/100%</td>
<td rowspan="1" colspan="1">166/76.4%</td>
<td rowspan="1" colspan="1">200/88.4%</td>
<td rowspan="1" colspan="1">184/79.6%</td>
<td rowspan="1" colspan="1">190/82.9%</td>
<td rowspan="1" colspan="1">181/81.1%</td>
<td rowspan="1" colspan="1">174/82.0%</td>
<td rowspan="1" colspan="1">188/81.0%</td>
<td rowspan="1" colspan="1">192/88.8%</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Dog</td>
<td rowspan="1" colspan="1">169/75.1%</td>
<td rowspan="1" colspan="1">166/76.4%</td>
<td rowspan="1" colspan="1">179/100%</td>
<td rowspan="1" colspan="1">171/74.3%</td>
<td rowspan="1" colspan="1">159/68.8%</td>
<td rowspan="1" colspan="1">165/72.0%</td>
<td rowspan="1" colspan="1">153/67.6%</td>
<td rowspan="1" colspan="1">164/83.2%</td>
<td rowspan="1" colspan="1">162/69.5%</td>
<td rowspan="1" colspan="1">162/73.3%</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Human</td>
<td rowspan="1" colspan="1">213/95.0%</td>
<td rowspan="1" colspan="1">200/88.4%</td>
<td rowspan="1" colspan="1">171/74.3%</td>
<td rowspan="1" colspan="1">222/100%</td>
<td rowspan="1" colspan="1">197/83.4%</td>
<td rowspan="1" colspan="1">204/87.5%</td>
<td rowspan="1" colspan="1">186/78.8%</td>
<td rowspan="1" colspan="1">175/76.4%</td>
<td rowspan="1" colspan="1">201/84.8%</td>
<td rowspan="1" colspan="1">203/91.0%</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Marmoset</td>
<td rowspan="1" colspan="1">196/85.2%</td>
<td rowspan="1" colspan="1">184/79.6%</td>
<td rowspan="1" colspan="1">159/68.8%</td>
<td rowspan="1" colspan="1">197/83.4%</td>
<td rowspan="1" colspan="1">211/100%</td>
<td rowspan="1" colspan="1">190/80.5%</td>
<td rowspan="1" colspan="1">178/76.3%</td>
<td rowspan="1" colspan="1">165/72.3%</td>
<td rowspan="1" colspan="1">191/80.9%</td>
<td rowspan="1" colspan="1">190/84.4%</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Mouse</td>
<td rowspan="1" colspan="1">200/86.9%</td>
<td rowspan="1" colspan="1">190/82.9%</td>
<td rowspan="1" colspan="1">165/72.0%</td>
<td rowspan="1" colspan="1">204/87.5%</td>
<td rowspan="1" colspan="1">190/80.5%</td>
<td rowspan="1" colspan="1">215/100%</td>
<td rowspan="1" colspan="1">175/72.9%</td>
<td rowspan="1" colspan="1">168/73.3%</td>
<td rowspan="1" colspan="1">201/87.3%</td>
<td rowspan="1" colspan="1">191/83.7%</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Opossum</td>
<td rowspan="1" colspan="1">182/78.1%</td>
<td rowspan="1" colspan="1">181/81.1%</td>
<td rowspan="1" colspan="1">153/67.6%</td>
<td rowspan="1" colspan="1">186/78.8%</td>
<td rowspan="1" colspan="1">178/76.3%</td>
<td rowspan="1" colspan="1">175/72.9%</td>
<td rowspan="1" colspan="1">200/100%</td>
<td rowspan="1" colspan="1">158/70.5%</td>
<td rowspan="1" colspan="1">178/74.7%</td>
<td rowspan="1" colspan="1">176/77.1%</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Pig</td>
<td rowspan="1" colspan="1">173/77.2%</td>
<td rowspan="1" colspan="1">174/82.0%</td>
<td rowspan="1" colspan="1">164/83.2%</td>
<td rowspan="1" colspan="1">175/76.4%</td>
<td rowspan="1" colspan="1">165/72.3%</td>
<td rowspan="1" colspan="1">168/73.3%</td>
<td rowspan="1" colspan="1">158/70.5%</td>
<td rowspan="1" colspan="1">182/100%</td>
<td rowspan="1" colspan="1">168/73.0%</td>
<td rowspan="1" colspan="1">170/78.7%</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Rat</td>
<td rowspan="1" colspan="1">196/83.4%</td>
<td rowspan="1" colspan="1">188/81.0%</td>
<td rowspan="1" colspan="1">162/69.5%</td>
<td rowspan="1" colspan="1">201/84.8%</td>
<td rowspan="1" colspan="1">191/80.9%</td>
<td rowspan="1" colspan="1">201/87.3%</td>
<td rowspan="1" colspan="1">178/74.7%</td>
<td rowspan="1" colspan="1">168/73.0%</td>
<td rowspan="1" colspan="1">216/100%</td>
<td rowspan="1" colspan="1">189/81.8%</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Rhesus</td>
<td rowspan="1" colspan="1">202/93.0%</td>
<td rowspan="1" colspan="1">192/88.8%</td>
<td rowspan="1" colspan="1">162/73.3%</td>
<td rowspan="1" colspan="1">203/91.0%</td>
<td rowspan="1" colspan="1">190/84.4%</td>
<td rowspan="1" colspan="1">191/83.7%</td>
<td rowspan="1" colspan="1">176/77.1%</td>
<td rowspan="1" colspan="1">170/78.7%</td>
<td rowspan="1" colspan="1">189/81.8%</td>
<td rowspan="1" colspan="1">204/100%</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="gkt144-TF4">
<p>(Number of common k-mer)/(Percentage of common k-mer). Evolutionarily closer genome pairs retain higher percentage of common k-mer pattern.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec>
<title>Comparison between species I: applying a human model to other species</title>
<p>To compare CpG island sequences among species, we used a human model of 4-mer and the 2nd order MM background model to classify CpG island sequences in nine mammalian genomes. In this case, the human CpG island sequences were used as the training data, and CpG island sequences in other species served as the test data. The experimental procedure is summarized in
<xref ref-type="fig" rid="gkt144-F13">Figure 13</xref>
. We predicted the predictive power of the human model to decrease as the evolutionary distance increased between species. As shown in
<xref ref-type="fig" rid="gkt144-F14">Figure 14</xref>
, the results are consistent with evolutionary history.
<fig id="gkt144-F13" position="float">
<label>Figure 13.</label>
<caption>
<p>Applying human CpG island model to different species.</p>
</caption>
<graphic xlink:href="gkt144f13p"></graphic>
</fig>
<fig id="gkt144-F14" position="float">
<label>Figure 14.</label>
<caption>
<p>Result of applying a human model to different species.</p>
</caption>
<graphic xlink:href="gkt144f14p"></graphic>
</fig>
</p>
</sec>
<sec>
<title>Comparison between species II: models between human and other species</title>
<p>To further compare CpG island sequences among species, we used the human CpG island sequences as the positive data set and other species sequences as the negative data set and performed 10-fold cross validation experiments.
<xref ref-type="fig" rid="gkt144-F15">Figure 15</xref>
illustrates the experimental scheme. The result in
<xref ref-type="fig" rid="gkt144-F16">Figure 16</xref>
is consistent with evolutionary history: prediction accuracy was low for close species (e.g., human versus chimp), and high prediction accuracy was observed for distant species, e.g., human versus opossum.
<fig id="gkt144-F15" position="float">
<label>Figure 15.</label>
<caption>
<p>Machine learning analysis human as positive data set and others as negative data set.</p>
</caption>
<graphic xlink:href="gkt144f15p"></graphic>
</fig>
<fig id="gkt144-F16" position="float">
<label>Figure 16.</label>
<caption>
<p>Result of machine learning analysis human as positive data set and other species as negative data set. The result is consistent with evolutionary history since when two species are close, e.g., human versus chimp, the prediction accuracy is low while two species are distant, e.g., human versus opossum, the prediction accuracy is high.</p>
</caption>
<graphic xlink:href="gkt144f16p"></graphic>
</fig>
</p>
</sec>
</sec>
<sec sec-type="conclusions">
<title>CONCLUSION</title>
<p>CpG island sequences play critical roles in development and disease biology. Despite the number of important analytical studies on CpG island sequence characteristics, no comparative analysis exists on CpG island sequences among different species. One possible reason is conventional sequence analysis techniques are currently ineffective for analyzing highly biased character composition of CpG island sequences. In this article, we proposed new approaches using exact patterns of CpG island sequence called k-mer and k-flank. By using genome distance based on rank correlation tests, we show that k-mer and k-flank patterns nearby CpG sites can correctly reconstruct the phylogeny of 10 mammalian genomes. We further report that k-mers, by using various machine learning algorithms, can be used to characterize CpG islands sequences. Conserved k-mers mean conservation of short sequence in CpG island sequences. Thus, our findings of conserved k-mers in CpG island sequences extend our current knowledge of CpG islands as CpG over-represented sequences to partially conserved sequences. In addition, human model testing on nine additional mammalian genomes confirms that k-mers indeed are signatures consistent with their evolutionary history. We conclude for the first time that CpG islands sequences of 10 mammalian genomes contain evolutionary evidence for non-random pattern characteristics.</p>
</sec>
<sec>
<title>FUNDING</title>
<p>
<funding-source>Next-Generation Information Computing Development Program</funding-source>
through the
<funding-source>National Research Foundation of Korea (NRF)</funding-source>
funded by the
<funding-source>Ministry of Education, Science and Technology</funding-source>
[
<award-id>2012M3C4A7033341</award-id>
];
<funding-source>Next-Generation BioGreen 21 Program</funding-source>
[
<award-id>PJ009037022012</award-id>
];
<funding-source>Rural Development Administration, Republic of Korea</funding-source>
;
<funding-source>World Class University Program through the National Research Foundation of Korea</funding-source>
funded by the
<funding-source>Ministry of Education, Science, and Technology</funding-source>
, under Grant
<award-id>R31-10008</award-id>
;
<funding-source>National Institutes of Health</funding-source>
[
<award-id>CA113001</award-id>
]. Funding for open access charge:
<funding-source>Seoul National University</funding-source>
.</p>
<p>
<italic>Conflict of interest statement</italic>
. None declared.</p>
</sec>
</body>
<back>
<ack>
<title>ACKNOWLEDGEMENTS</title>
<p>Thanks to Craig Jackson at Indiana University for the earlier work of the rank based analysis of CpG island sequences.</p>
</ack>
<ref-list>
<title>REFERENCES</title>
<ref id="gkt144-B1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jabbari</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Bernardi</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Cytosine methylation and CpG, TpG (CpA) and TpA frequencies</article-title>
<source>Gene</source>
<year>2004</year>
<volume>333</volume>
<fpage>143</fpage>
<lpage>149</lpage>
<pub-id pub-id-type="pmid">15177689</pub-id>
</element-citation>
</ref>
<ref id="gkt144-B2">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jones</surname>
<given-names>PA</given-names>
</name>
</person-group>
<article-title>Functions of DNA methylation: islands, start sites, gene bodies and beyond</article-title>
<source>Nat. Rev. Genet.</source>
<year>2012</year>
<volume>13</volume>
<fpage>484</fpage>
<lpage>492</lpage>
<pub-id pub-id-type="pmid">22641018</pub-id>
</element-citation>
</ref>
<ref id="gkt144-B3">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bell</surname>
<given-names>CG</given-names>
</name>
<name>
<surname>Wilson</surname>
<given-names>GA</given-names>
</name>
<name>
<surname>Butcher</surname>
<given-names>LM</given-names>
</name>
<name>
<surname>Roos</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Walter</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Beck</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Human-specific CpG “beacons” identify loci associated with human-specific traits and disease</article-title>
<source>Epigenetics</source>
<year>2012</year>
<volume>7</volume>
<fpage>1188</fpage>
<lpage>1199</lpage>
<pub-id pub-id-type="pmid">22968434</pub-id>
</element-citation>
</ref>
<ref id="gkt144-B4">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Portela</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Esteller</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Epigenetic modifications and human disease [Review]</article-title>
<source>Nat. Biotechnol.</source>
<year>2010</year>
<volume>28</volume>
<fpage>1057</fpage>
<lpage>1068</lpage>
<pub-id pub-id-type="pmid">20944598</pub-id>
</element-citation>
</ref>
<ref id="gkt144-B5">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Feinberg</surname>
<given-names>AP</given-names>
</name>
</person-group>
<article-title>Epigenomics reveals a functional genome anatomy and a new approach to common disease</article-title>
<source>Nat. Biotechnol.</source>
<year>2010</year>
<volume>28</volume>
<fpage>1049</fpage>
<lpage>1052</lpage>
<pub-id pub-id-type="pmid">20944596</pub-id>
</element-citation>
</ref>
<ref id="gkt144-B6">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Burge</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Campbell</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Karlin</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Over- and under-representation of short oligonucleotides in DNA sequences</article-title>
<source>Proc. Natl Acad. Sci. USA</source>
<year>1992</year>
<volume>89</volume>
<fpage>1358</fpage>
<lpage>1362</lpage>
<pub-id pub-id-type="pmid">1741388</pub-id>
</element-citation>
</ref>
<ref id="gkt144-B7">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Scarano</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Iaccarino</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Grippo</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Parisi</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>The heterogeneity of thymine methyl group origin in DNA pyrimidine isostichs of developing sea urchin embryos</article-title>
<source>Proc. Natl Acad. Sci. USA</source>
<year>1967</year>
<volume>57</volume>
<fpage>1394</fpage>
<lpage>1400</lpage>
<pub-id pub-id-type="pmid">5231746</pub-id>
</element-citation>
</ref>
<ref id="gkt144-B8">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bock</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Paulsen</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Tierling</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Mikeska</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Lengauer</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Walter</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>CpG island methylation in human lymphocytes is highly correlated with DNA sequence, repeats, and predicted DNA structure</article-title>
<source>PLoS Genet.</source>
<year>2006</year>
<volume>2</volume>
<fpage>e26</fpage>
<pub-id pub-id-type="pmid">16520826</pub-id>
</element-citation>
</ref>
<ref id="gkt144-B9">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fatemi</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Pao</surname>
<given-names>MM</given-names>
</name>
<name>
<surname>Jeong</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Gal-Yam</surname>
<given-names>EN</given-names>
</name>
<name>
<surname>Egger</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Weisenberger</surname>
<given-names>DJ</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>PA</given-names>
</name>
</person-group>
<article-title>Footprinting of mammalian promoters: use of a CpG DNA methyltransferase revealing nucleosome positions at a single molecule level</article-title>
<source>Nucleic Acids Res.</source>
<year>2005</year>
<volume>33</volume>
<fpage>e176</fpage>
<pub-id pub-id-type="pmid">16314307</pub-id>
</element-citation>
</ref>
<ref id="gkt144-B10">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Larsen</surname>
<given-names>F</given-names>
</name>
<etal></etal>
</person-group>
<article-title>CpG islands as gene markers in the human genome</article-title>
<source>Genomics</source>
<year>1992</year>
<volume>13</volume>
<fpage>1095</fpage>
<lpage>1107</lpage>
<pub-id pub-id-type="pmid">1505946</pub-id>
</element-citation>
</ref>
<ref id="gkt144-B11">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Saxonov</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Berg</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Brutlag</surname>
<given-names>DL</given-names>
</name>
</person-group>
<article-title>A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters</article-title>
<source>Proc. Natl Acad. Sci. USA</source>
<year>2006</year>
<volume>103</volume>
<fpage>1412</fpage>
<lpage>1417</lpage>
<pub-id pub-id-type="pmid">16432200</pub-id>
</element-citation>
</ref>
<ref id="gkt144-B12">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Antequera</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>Structure, function and evolution of CpG island promoters</article-title>
<source>Cell Mol. Life Sci.</source>
<year>2003</year>
<volume>60</volume>
<fpage>1647</fpage>
<lpage>1658</lpage>
<pub-id pub-id-type="pmid">14504655</pub-id>
</element-citation>
</ref>
<ref id="gkt144-B13">
<label>13</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sharif</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Endo</surname>
<given-names>TA</given-names>
</name>
<name>
<surname>Toyoda</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Koseki</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>Divergence of CpG island promoters: a consequence or cause of evolution?
<italic>Dev</italic>
</article-title>
<source>Growth Differ.</source>
<year>2010</year>
<volume>52</volume>
<fpage>545</fpage>
<lpage>554</lpage>
</element-citation>
</ref>
<ref id="gkt144-B14">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yan</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Masson</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Rosati</surname>
<given-names>B</given-names>
</name>
<name>
<surname>McKinnon</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Evolution of CpG island promoter function underlies changes in KChIP2 potassium channel subunit gene expression in mammalian heart</article-title>
<source>Proc. Natl Acad. Sci. USA</source>
<year>2012</year>
<volume>109</volume>
<fpage>1601</fpage>
<lpage>1606</lpage>
<pub-id pub-id-type="pmid">22307618</pub-id>
</element-citation>
</ref>
<ref id="gkt144-B15">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gardiner-Garden</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Frommer</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>CpG islands in vertebrate genomes</article-title>
<source>J. Mol. Biol.</source>
<year>1987</year>
<volume>196</volume>
<fpage>261</fpage>
<lpage>282</lpage>
<pub-id pub-id-type="pmid">3656447</pub-id>
</element-citation>
</ref>
<ref id="gkt144-B16">
<label>16</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Takai</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>PA</given-names>
</name>
</person-group>
<article-title>Comprehensive analysis of CpG islands in human chromosomes 21 and 22</article-title>
<source>Proc. Natl Acad. Sci. USA</source>
<year>2002</year>
<volume>99</volume>
<fpage>3740</fpage>
<lpage>3745</lpage>
<pub-id pub-id-type="pmid">11891299</pub-id>
</element-citation>
</ref>
<ref id="gkt144-B17">
<label>17</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bock</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Analysing and interpreting DNA methylation data</article-title>
<source>Nat. Rev. Genet.</source>
<year>2012</year>
<volume>13</volume>
<fpage>705</fpage>
<lpage>719</lpage>
<pub-id pub-id-type="pmid">22986265</pub-id>
</element-citation>
</ref>
<ref id="gkt144-B18">
<label>18</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Caffo</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Jaffee</surname>
<given-names>HA</given-names>
</name>
<name>
<surname>Irizarry</surname>
<given-names>RA</given-names>
</name>
<name>
<surname>Feinberg</surname>
<given-names>AP</given-names>
</name>
</person-group>
<article-title>Redefining CpG islands using hidden Markov models</article-title>
<source>Biostatistics</source>
<year>2010</year>
<volume>11</volume>
<fpage>499</fpage>
<lpage>514</lpage>
<pub-id pub-id-type="pmid">20212320</pub-id>
</element-citation>
</ref>
<ref id="gkt144-B19">
<label>19</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Feuerbach</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Halachev</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Assenov</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Mller</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Bock</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Lengauer</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Analyzing epigenome data in context of genome evolution and human diseases</article-title>
<source>Methods Mol. Biol.</source>
<year>2012</year>
<volume>856</volume>
<fpage>431</fpage>
<lpage>467</lpage>
<pub-id pub-id-type="pmid">22399470</pub-id>
</element-citation>
</ref>
<ref id="gkt144-B20">
<label>20</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cohen</surname>
<given-names>NM</given-names>
</name>
<name>
<surname>Kenigsberg</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Tanay</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Primate CpG islands are maintained by heterogeneous evolutionary regimes involving minimal selection</article-title>
<source>Cell</source>
<year>2011</year>
<volume>145</volume>
<fpage>773</fpage>
<lpage>786</lpage>
<pub-id pub-id-type="pmid">21620139</pub-id>
</element-citation>
</ref>
<ref id="gkt144-B21">
<label>21</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nussinov</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Compositional variations in DNA sequences</article-title>
<source>Comput. Appl. Biosci.</source>
<year>1991</year>
<volume>7</volume>
<fpage>287</fpage>
<lpage>293</lpage>
<pub-id pub-id-type="pmid">1913208</pub-id>
</element-citation>
</ref>
<ref id="gkt144-B22">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Saitou</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Nei</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>The neighbor-joining method: a new method for reconstructing phylogenetic trees</article-title>
<source>Mol. Biol. Evol.</source>
<year>1987</year>
<volume>4</volume>
<fpage>406</fpage>
<lpage>425</lpage>
<pub-id pub-id-type="pmid">3447015</pub-id>
</element-citation>
</ref>
<ref id="gkt144-B23">
<label>23</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kendall</surname>
<given-names>MG</given-names>
</name>
</person-group>
<article-title>A new measure of rank correlation</article-title>
<source>Biometrika.</source>
<year>1983</year>
<volume>30</volume>
<fpage>81</fpage>
<lpage>93</lpage>
</element-citation>
</ref>
<ref id="gkt144-B24">
<label>24</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Kendall</surname>
<given-names>MG</given-names>
</name>
</person-group>
<source>Rank Correlation Methods</source>
<year>1970</year>
<comment>4th edn. Charles Griffin, London</comment>
</element-citation>
</ref>
<ref id="gkt144-B25">
<label>25</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fredsiund</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>PHY-FI: fast and easy online creation and manipulation of phylogeny color figures</article-title>
<source>BMC Bioinformatics</source>
<year>2006</year>
<volume>7</volume>
<fpage>315</fpage>
<pub-id pub-id-type="pmid">16792795</pub-id>
</element-citation>
</ref>
<ref id="gkt144-B26">
<label>26</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Witten</surname>
<given-names>LH</given-names>
</name>
<name>
<surname>Frank</surname>
<given-names>E</given-names>
</name>
</person-group>
<source>Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations</source>
<year>2000</year>
<publisher-loc>San Francisco, CA</publisher-loc>
<publisher-name>Morgan Kaufmann Publishers Inc.</publisher-name>
</element-citation>
</ref>
<ref id="gkt144-B27">
<label>27</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Miele</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Bourguignon</surname>
<given-names>PY</given-names>
</name>
<name>
<surname>Robelin</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Nuel</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Richard</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>seq++: analyzing biological sequences with a range of Markov-related models</article-title>
<source>Bioinformatics</source>
<year>2005</year>
<volume>21</volume>
<fpage>2783</fpage>
<lpage>2784</lpage>
<pub-id pub-id-type="pmid">15774554</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000F51 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000F51 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:3643570
   |texte=   Comparative analysis using K-mer and K-flank patterns provides evidence for CpG island sequence evolution in mammalian genomes
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:23519616" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021