Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A novel k-mer mixture logistic regression for methylation susceptibility modeling of CpG dinucleotides in human gene promoters

Identifieur interne : 000A85 ( Pmc/Corpus ); précédent : 000A84; suivant : 000A86

A novel k-mer mixture logistic regression for methylation susceptibility modeling of CpG dinucleotides in human gene promoters

Auteurs : Youngik Yang ; Kenneth Nephew ; Sun Kim

Source :

RBID : PMC:3311103

Abstract

Background

DNA methylation is essential for normal development and differentiation and plays a crucial role in the development of nearly all types of cancer. Aberrant DNA methylation patterns, including genome-wide hypomethylation and region-specific hypermethylation, are frequently observed and contribute to the malignant phenotype. A number of studies have recently identified distinct features of genomic sequences that can be used for modeling specific DNA sequences that may be susceptible to aberrant CpG methylation in both cancer and normal cells. Although it is now possible, using next generation sequencing technologies, to assess human methylomes at base resolution, no reports currently exist on modeling cell type-specific DNA methylation susceptibility. Thus, we conducted a comprehensive modeling study of cell type-specific DNA methylation susceptibility at three different resolutions: CpG dinucleotides, CpG segments, and individual gene promoter regions.

Results

Using a k-mer mixture logistic regression model, we effectively modeled DNA methylation susceptibility across five different cell types. Further, at the segment level, we achieved up to 0.75 in AUC prediction accuracy in a 10-fold cross validation study using a mixture of k-mers.

Conclusions

The significance of these results is three fold: 1) this is the first report to indicate that CpG methylation susceptible "segments" exist; 2) our model demonstrates the significance of certain k-mers for the mixture model, potentially highlighting DNA sequence features (k-mers) of differentially methylated, promoter CpG island sequences across different tissue types; 3) as only 3 or 4 bp patterns had previously been used for modeling DNA methylation susceptibility, ours is the first demonstration that 6-mer modeling can be performed without loss of accuracy.


Url:
DOI: 10.1186/1471-2105-13-S3-S15
PubMed: 22536899
PubMed Central: 3311103

Links to Exploration step

PMC:3311103

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">A novel k-mer mixture logistic regression for methylation susceptibility modeling of CpG dinucleotides in human gene promoters</title>
<author>
<name sortKey="Yang, Youngik" sort="Yang, Youngik" uniqKey="Yang Y" first="Youngik" last="Yang">Youngik Yang</name>
<affiliation>
<nlm:aff id="I1">J Craig Venter Institute, San Diego, CA, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Nephew, Kenneth" sort="Nephew, Kenneth" uniqKey="Nephew K" first="Kenneth" last="Nephew">Kenneth Nephew</name>
<affiliation>
<nlm:aff id="I2">Medical Sciences Program, Indiana University School of Medicine, Bloomington, IN, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kim, Sun" sort="Kim, Sun" uniqKey="Kim S" first="Sun" last="Kim">Sun Kim</name>
<affiliation>
<nlm:aff id="I3">School of Computer Science and Engineering, Bioinformatics Institute, Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">22536899</idno>
<idno type="pmc">3311103</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3311103</idno>
<idno type="RBID">PMC:3311103</idno>
<idno type="doi">10.1186/1471-2105-13-S3-S15</idno>
<date when="2012">2012</date>
<idno type="wicri:Area/Pmc/Corpus">000A85</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000A85</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">A novel k-mer mixture logistic regression for methylation susceptibility modeling of CpG dinucleotides in human gene promoters</title>
<author>
<name sortKey="Yang, Youngik" sort="Yang, Youngik" uniqKey="Yang Y" first="Youngik" last="Yang">Youngik Yang</name>
<affiliation>
<nlm:aff id="I1">J Craig Venter Institute, San Diego, CA, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Nephew, Kenneth" sort="Nephew, Kenneth" uniqKey="Nephew K" first="Kenneth" last="Nephew">Kenneth Nephew</name>
<affiliation>
<nlm:aff id="I2">Medical Sciences Program, Indiana University School of Medicine, Bloomington, IN, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kim, Sun" sort="Kim, Sun" uniqKey="Kim S" first="Sun" last="Kim">Sun Kim</name>
<affiliation>
<nlm:aff id="I3">School of Computer Science and Engineering, Bioinformatics Institute, Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>DNA methylation is essential for normal development and differentiation and plays a crucial role in the development of nearly all types of cancer. Aberrant DNA methylation patterns, including genome-wide hypomethylation and region-specific hypermethylation, are frequently observed and contribute to the malignant phenotype. A number of studies have recently identified distinct features of genomic sequences that can be used for modeling specific DNA sequences that may be susceptible to aberrant CpG methylation in both cancer and normal cells. Although it is now possible, using next generation sequencing technologies, to assess human methylomes at base resolution, no reports currently exist on modeling cell type-specific DNA methylation susceptibility. Thus, we conducted a comprehensive modeling study of cell type-specific DNA methylation susceptibility at three different resolutions: CpG dinucleotides, CpG segments, and individual gene promoter regions.</p>
</sec>
<sec>
<title>Results</title>
<p>Using a k-mer mixture logistic regression model, we effectively modeled DNA methylation susceptibility across five different cell types. Further, at the segment level, we achieved up to 0.75 in AUC prediction accuracy in a 10-fold cross validation study using a mixture of k-mers.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>The significance of these results is three fold: 1) this is the first report to indicate that CpG methylation susceptible "segments" exist; 2) our model demonstrates the significance of certain k-mers for the mixture model, potentially highlighting DNA sequence features (k-mers) of differentially methylated, promoter CpG island sequences across different tissue types; 3) as only 3 or 4 bp patterns had previously been used for modeling DNA methylation susceptibility, ours is the first demonstration that 6-mer modeling can be performed without loss of accuracy.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Bird, A" uniqKey="Bird A">A Bird</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jones, Pa" uniqKey="Jones P">PA Jones</name>
</author>
<author>
<name sortKey="Laird, Pw" uniqKey="Laird P">PW Laird</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ting, Ah" uniqKey="Ting A">AH Ting</name>
</author>
<author>
<name sortKey="Mcgarvey, Km" uniqKey="Mcgarvey K">KM McGarvey</name>
</author>
<author>
<name sortKey="Baylin, Sb" uniqKey="Baylin S">SB Baylin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Herman, Jg" uniqKey="Herman J">JG Herman</name>
</author>
<author>
<name sortKey="Baylin, Sb" uniqKey="Baylin S">SB Baylin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Costello, Jf" uniqKey="Costello J">JF Costello</name>
</author>
<author>
<name sortKey="Fruhwald, Mc" uniqKey="Fruhwald M">MC Frühwald</name>
</author>
<author>
<name sortKey="Smiraglia, Dj" uniqKey="Smiraglia D">DJ Smiraglia</name>
</author>
<author>
<name sortKey="Rush, Lj" uniqKey="Rush L">LJ Rush</name>
</author>
<author>
<name sortKey="Robertson, Gp" uniqKey="Robertson G">GP Robertson</name>
</author>
<author>
<name sortKey="Gao, X" uniqKey="Gao X">X Gao</name>
</author>
<author>
<name sortKey="Wright, Fa" uniqKey="Wright F">FA Wright</name>
</author>
<author>
<name sortKey="Feramisco, Jd" uniqKey="Feramisco J">JD Feramisco</name>
</author>
<author>
<name sortKey="Peltom Ki, P" uniqKey="Peltom Ki P">P Peltomäki</name>
</author>
<author>
<name sortKey="Lang, Jc" uniqKey="Lang J">JC Lang</name>
</author>
<author>
<name sortKey="Schuller, De" uniqKey="Schuller D">DE Schuller</name>
</author>
<author>
<name sortKey="Yu, L" uniqKey="Yu L">L Yu</name>
</author>
<author>
<name sortKey="Bloomfield, Cd" uniqKey="Bloomfield C">CD Bloomfield</name>
</author>
<author>
<name sortKey="Caligiuri, Ma" uniqKey="Caligiuri M">MA Caligiuri</name>
</author>
<author>
<name sortKey="Yates, A" uniqKey="Yates A">A Yates</name>
</author>
<author>
<name sortKey="Nishikawa, R" uniqKey="Nishikawa R">R Nishikawa</name>
</author>
<author>
<name sortKey="Su Huang, H" uniqKey="Su Huang H">H Su Huang</name>
</author>
<author>
<name sortKey="Petrelli, Nj" uniqKey="Petrelli N">NJ Petrelli</name>
</author>
<author>
<name sortKey="Zhang, X" uniqKey="Zhang X">X Zhang</name>
</author>
<author>
<name sortKey="O Dorisio, Ms" uniqKey="O Dorisio M">MS O'Dorisio</name>
</author>
<author>
<name sortKey="Held, Wa" uniqKey="Held W">WA Held</name>
</author>
<author>
<name sortKey="Cavenee, Wk" uniqKey="Cavenee W">WK Cavenee</name>
</author>
<author>
<name sortKey="Plass, C" uniqKey="Plass C">C Plass</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Laird, Pw" uniqKey="Laird P">PW Laird</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Feltus, Fa" uniqKey="Feltus F">FA Feltus</name>
</author>
<author>
<name sortKey="Lee, Ek" uniqKey="Lee E">EK Lee</name>
</author>
<author>
<name sortKey="Costello, Jf" uniqKey="Costello J">JF Costello</name>
</author>
<author>
<name sortKey="Plass, C" uniqKey="Plass C">C Plass</name>
</author>
<author>
<name sortKey="Vertino, Pm" uniqKey="Vertino P">PM Vertino</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Prufer, K" uniqKey="Prufer K">K Prüfer</name>
</author>
<author>
<name sortKey="Stenzel, U" uniqKey="Stenzel U">U Stenzel</name>
</author>
<author>
<name sortKey="Dannemann, M" uniqKey="Dannemann M">M Dannemann</name>
</author>
<author>
<name sortKey="Green, Re" uniqKey="Green R">RE Green</name>
</author>
<author>
<name sortKey="Lachmann, M" uniqKey="Lachmann M">M Lachmann</name>
</author>
<author>
<name sortKey="Kelso, J" uniqKey="Kelso J">J Kelso</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mccabe, Mt" uniqKey="Mccabe M">MT McCabe</name>
</author>
<author>
<name sortKey="Lee, Ek" uniqKey="Lee E">EK Lee</name>
</author>
<author>
<name sortKey="Vertino, Pm" uniqKey="Vertino P">PM Vertino</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Feltus, Fa" uniqKey="Feltus F">FA Feltus</name>
</author>
<author>
<name sortKey="Lee, Ek" uniqKey="Lee E">EK Lee</name>
</author>
<author>
<name sortKey="Costello, Jf" uniqKey="Costello J">JF Costello</name>
</author>
<author>
<name sortKey="Plass, C" uniqKey="Plass C">C Plass</name>
</author>
<author>
<name sortKey="Vertino, Pm" uniqKey="Vertino P">PM Vertino</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Keshet, I" uniqKey="Keshet I">I Keshet</name>
</author>
<author>
<name sortKey="Schlesinger, Y" uniqKey="Schlesinger Y">Y Schlesinger</name>
</author>
<author>
<name sortKey="Farkash, S" uniqKey="Farkash S">S Farkash</name>
</author>
<author>
<name sortKey="Rand, E" uniqKey="Rand E">E Rand</name>
</author>
<author>
<name sortKey="Hecht, M" uniqKey="Hecht M">M Hecht</name>
</author>
<author>
<name sortKey="Segal, E" uniqKey="Segal E">E Segal</name>
</author>
<author>
<name sortKey="Pikarski, E" uniqKey="Pikarski E">E Pikarski</name>
</author>
<author>
<name sortKey="Young, Ra" uniqKey="Young R">RA Young</name>
</author>
<author>
<name sortKey="Niveleau, A" uniqKey="Niveleau A">A Niveleau</name>
</author>
<author>
<name sortKey="Cedar, H" uniqKey="Cedar H">H Cedar</name>
</author>
<author>
<name sortKey="Simon, I" uniqKey="Simon I">I Simon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Goh, L" uniqKey="Goh L">L Goh</name>
</author>
<author>
<name sortKey="Murphy, Sk" uniqKey="Murphy S">SK Murphy</name>
</author>
<author>
<name sortKey="Muhkerjee, S" uniqKey="Muhkerjee S">S Muhkerjee</name>
</author>
<author>
<name sortKey="Furey, Ts" uniqKey="Furey T">TS Furey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fang, F" uniqKey="Fang F">F Fang</name>
</author>
<author>
<name sortKey="Fan, S" uniqKey="Fan S">S Fan</name>
</author>
<author>
<name sortKey="Zhang, X" uniqKey="Zhang X">X Zhang</name>
</author>
<author>
<name sortKey="Zhang, Mq" uniqKey="Zhang M">MQ Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bock, C" uniqKey="Bock C">C Bock</name>
</author>
<author>
<name sortKey="Paulsen, M" uniqKey="Paulsen M">M Paulsen</name>
</author>
<author>
<name sortKey="Tierling, S" uniqKey="Tierling S">S Tierling</name>
</author>
<author>
<name sortKey="Mikeska, T" uniqKey="Mikeska T">T Mikeska</name>
</author>
<author>
<name sortKey="Lengauer, T" uniqKey="Lengauer T">T Lengauer</name>
</author>
<author>
<name sortKey="Walter, J" uniqKey="Walter J">J Walter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Handa, V" uniqKey="Handa V">V Handa</name>
</author>
<author>
<name sortKey="Jeltsch, A" uniqKey="Jeltsch A">A Jeltsch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, Y" uniqKey="Zhang Y">Y Zhang</name>
</author>
<author>
<name sortKey="Rohde, C" uniqKey="Rohde C">C Rohde</name>
</author>
<author>
<name sortKey="Tierling, S" uniqKey="Tierling S">S Tierling</name>
</author>
<author>
<name sortKey="Jurkowski, Tp" uniqKey="Jurkowski T">TP Jurkowski</name>
</author>
<author>
<name sortKey="Bock, C" uniqKey="Bock C">C Bock</name>
</author>
<author>
<name sortKey="Santacruz, D" uniqKey="Santacruz D">D Santacruz</name>
</author>
<author>
<name sortKey="Ragozin, S" uniqKey="Ragozin S">S Ragozin</name>
</author>
<author>
<name sortKey="Reinhardt, R" uniqKey="Reinhardt R">R Reinhardt</name>
</author>
<author>
<name sortKey="Groth, M" uniqKey="Groth M">M Groth</name>
</author>
<author>
<name sortKey="Walter, J" uniqKey="Walter J">J Walter</name>
</author>
<author>
<name sortKey="Jeltsch, A" uniqKey="Jeltsch A">A Jeltsch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brunner, Al" uniqKey="Brunner A">AL Brunner</name>
</author>
<author>
<name sortKey="Johnson, Ds" uniqKey="Johnson D">DS Johnson</name>
</author>
<author>
<name sortKey="Kim, Sw" uniqKey="Kim S">SW Kim</name>
</author>
<author>
<name sortKey="Valouev, A" uniqKey="Valouev A">A Valouev</name>
</author>
<author>
<name sortKey="Reddy, Te" uniqKey="Reddy T">TE Reddy</name>
</author>
<author>
<name sortKey="Neff, Nf" uniqKey="Neff N">NF Neff</name>
</author>
<author>
<name sortKey="Anton, E" uniqKey="Anton E">E Anton</name>
</author>
<author>
<name sortKey="Medina, C" uniqKey="Medina C">C Medina</name>
</author>
<author>
<name sortKey="Nguyen, L" uniqKey="Nguyen L">L Nguyen</name>
</author>
<author>
<name sortKey="Chiao, E" uniqKey="Chiao E">E Chiao</name>
</author>
<author>
<name sortKey="Oyolu, Cb" uniqKey="Oyolu C">CB Oyolu</name>
</author>
<author>
<name sortKey="Schroth, Gp" uniqKey="Schroth G">GP Schroth</name>
</author>
<author>
<name sortKey="Absher, Dm" uniqKey="Absher D">DM Absher</name>
</author>
<author>
<name sortKey="Baker, Jc" uniqKey="Baker J">JC Baker</name>
</author>
<author>
<name sortKey="Myers, Rm" uniqKey="Myers R">RM Myers</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lister, R" uniqKey="Lister R">R Lister</name>
</author>
<author>
<name sortKey="Pelizzola, M" uniqKey="Pelizzola M">M Pelizzola</name>
</author>
<author>
<name sortKey="Dowen, Rh" uniqKey="Dowen R">RH Dowen</name>
</author>
<author>
<name sortKey="Hawkins, Rd" uniqKey="Hawkins R">RD Hawkins</name>
</author>
<author>
<name sortKey="Hon, G" uniqKey="Hon G">G Hon</name>
</author>
<author>
<name sortKey="Tonti Filippini, J" uniqKey="Tonti Filippini J">J Tonti-Filippini</name>
</author>
<author>
<name sortKey="Nery, Jr" uniqKey="Nery J">JR Nery</name>
</author>
<author>
<name sortKey="Lee, L" uniqKey="Lee L">L Lee</name>
</author>
<author>
<name sortKey="Ye, Z" uniqKey="Ye Z">Z Ye</name>
</author>
<author>
<name sortKey="Ngo, Qm" uniqKey="Ngo Q">QM Ngo</name>
</author>
<author>
<name sortKey="Edsall, L" uniqKey="Edsall L">L Edsall</name>
</author>
<author>
<name sortKey="Antosiewicz Bourget, J" uniqKey="Antosiewicz Bourget J">J Antosiewicz-Bourget</name>
</author>
<author>
<name sortKey="Stewart, R" uniqKey="Stewart R">R Stewart</name>
</author>
<author>
<name sortKey="Ruotti, V" uniqKey="Ruotti V">V Ruotti</name>
</author>
<author>
<name sortKey="Millar, Ah" uniqKey="Millar A">AH Millar</name>
</author>
<author>
<name sortKey="Thomson, Ja" uniqKey="Thomson J">JA Thomson</name>
</author>
<author>
<name sortKey="Ren, B" uniqKey="Ren B">B Ren</name>
</author>
<author>
<name sortKey="Ecker, Jr" uniqKey="Ecker J">JR Ecker</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Taylor, Kh" uniqKey="Taylor K">KH Taylor</name>
</author>
<author>
<name sortKey="Kramer, Rs" uniqKey="Kramer R">RS Kramer</name>
</author>
<author>
<name sortKey="Davis, Wj" uniqKey="Davis W">WJ Davis</name>
</author>
<author>
<name sortKey="Guo, J" uniqKey="Guo J">J Guo</name>
</author>
<author>
<name sortKey="Duff, Dj" uniqKey="Duff D">DJ Duff</name>
</author>
<author>
<name sortKey="Xu, D" uniqKey="Xu D">D Xu</name>
</author>
<author>
<name sortKey="Caldwell, Cw" uniqKey="Caldwell C">CW Caldwell</name>
</author>
<author>
<name sortKey="Shi, H" uniqKey="Shi H">H Shi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kim, S" uniqKey="Kim S">S Kim</name>
</author>
<author>
<name sortKey="Li, M" uniqKey="Li M">M Li</name>
</author>
<author>
<name sortKey="Paik, H" uniqKey="Paik H">H Paik</name>
</author>
<author>
<name sortKey="Nephew, K" uniqKey="Nephew K">K Nephew</name>
</author>
<author>
<name sortKey="Shi, H" uniqKey="Shi H">H Shi</name>
</author>
<author>
<name sortKey="Kramer, R" uniqKey="Kramer R">R Kramer</name>
</author>
<author>
<name sortKey="Xu, D" uniqKey="Xu D">D Xu</name>
</author>
<author>
<name sortKey="Huang, Th" uniqKey="Huang T">TH Huang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Previti, C" uniqKey="Previti C">C Previti</name>
</author>
<author>
<name sortKey="Harari, O" uniqKey="Harari O">O Harari</name>
</author>
<author>
<name sortKey="Zwir, I" uniqKey="Zwir I">I Zwir</name>
</author>
<author>
<name sortKey="Val, Cd" uniqKey="Val C">CD Val</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Breiman, L" uniqKey="Breiman L">L Breiman</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cormen, Th" uniqKey="Cormen T">TH Cormen</name>
</author>
<author>
<name sortKey="Leiserson, Ce" uniqKey="Leiserson C">CE Leiserson</name>
</author>
<author>
<name sortKey="Rivest, Rl" uniqKey="Rivest R">RL Rivest</name>
</author>
<author>
<name sortKey="Stein, C" uniqKey="Stein C">C Stein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, Y" uniqKey="Zhang Y">Y Zhang</name>
</author>
<author>
<name sortKey="Rohde, C" uniqKey="Rohde C">C Rohde</name>
</author>
<author>
<name sortKey="Tierling, S" uniqKey="Tierling S">S Tierling</name>
</author>
<author>
<name sortKey="Jurkowski, Tp" uniqKey="Jurkowski T">TP Jurkowski</name>
</author>
<author>
<name sortKey="Bock, C" uniqKey="Bock C">C Bock</name>
</author>
<author>
<name sortKey="Santacruz, D" uniqKey="Santacruz D">D Santacruz</name>
</author>
<author>
<name sortKey="Ragozin, S" uniqKey="Ragozin S">S Ragozin</name>
</author>
<author>
<name sortKey="Reinhardt, R" uniqKey="Reinhardt R">R Reinhardt</name>
</author>
<author>
<name sortKey="Groth, M" uniqKey="Groth M">M Groth</name>
</author>
<author>
<name sortKey="Walter, J" uniqKey="Walter J">J Walter</name>
</author>
<author>
<name sortKey="Jeltsch, A" uniqKey="Jeltsch A">A Jeltsch</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">22536899</article-id>
<article-id pub-id-type="pmc">3311103</article-id>
<article-id pub-id-type="publisher-id">1471-2105-13-S3-S15</article-id>
<article-id pub-id-type="doi">10.1186/1471-2105-13-S3-S15</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Proceedings</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>A novel k-mer mixture logistic regression for methylation susceptibility modeling of CpG dinucleotides in human gene promoters</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" id="A1">
<name>
<surname>Yang</surname>
<given-names>Youngik</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>yyang@jcvi.org</email>
</contrib>
<contrib contrib-type="author" id="A2">
<name>
<surname>Nephew</surname>
<given-names>Kenneth</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<email>knephew@indiana.edu</email>
</contrib>
<contrib contrib-type="author" corresp="yes" id="A3">
<name>
<surname>Kim</surname>
<given-names>Sun</given-names>
</name>
<xref ref-type="aff" rid="I3">3</xref>
<email>sunkim.bioinfo@snu.ac.kr</email>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
J Craig Venter Institute, San Diego, CA, USA</aff>
<aff id="I2">
<label>2</label>
Medical Sciences Program, Indiana University School of Medicine, Bloomington, IN, USA</aff>
<aff id="I3">
<label>3</label>
School of Computer Science and Engineering, Bioinformatics Institute, Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea</aff>
<pub-date pub-type="collection">
<year>2012</year>
</pub-date>
<pub-date pub-type="epub">
<day>21</day>
<month>3</month>
<year>2012</year>
</pub-date>
<volume>13</volume>
<issue>Suppl 3</issue>
<supplement>
<named-content content-type="supplement-title">ACM Conference on Bioinformatics, Computational Biology and Biomedicine 2011</named-content>
<named-content content-type="supplement-editor">Sun Kim and Wei Wang</named-content>
<named-content content-type="supplement-sponsor">Publication of this supplement has been supported by NSF support number NSF IIS1137427: III: Small: Women in Bioinformatics Initiative at ACM-BCB 2011.</named-content>
</supplement>
<fpage>S15</fpage>
<lpage>S15</lpage>
<permissions>
<copyright-statement>Copyright ©2012 Yang et al.; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2012</copyright-year>
<copyright-holder>Yang et al.; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0">
<license-p>This is an open access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1471-2105/13/S3/S15"></self-uri>
<abstract>
<sec>
<title>Background</title>
<p>DNA methylation is essential for normal development and differentiation and plays a crucial role in the development of nearly all types of cancer. Aberrant DNA methylation patterns, including genome-wide hypomethylation and region-specific hypermethylation, are frequently observed and contribute to the malignant phenotype. A number of studies have recently identified distinct features of genomic sequences that can be used for modeling specific DNA sequences that may be susceptible to aberrant CpG methylation in both cancer and normal cells. Although it is now possible, using next generation sequencing technologies, to assess human methylomes at base resolution, no reports currently exist on modeling cell type-specific DNA methylation susceptibility. Thus, we conducted a comprehensive modeling study of cell type-specific DNA methylation susceptibility at three different resolutions: CpG dinucleotides, CpG segments, and individual gene promoter regions.</p>
</sec>
<sec>
<title>Results</title>
<p>Using a k-mer mixture logistic regression model, we effectively modeled DNA methylation susceptibility across five different cell types. Further, at the segment level, we achieved up to 0.75 in AUC prediction accuracy in a 10-fold cross validation study using a mixture of k-mers.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>The significance of these results is three fold: 1) this is the first report to indicate that CpG methylation susceptible "segments" exist; 2) our model demonstrates the significance of certain k-mers for the mixture model, potentially highlighting DNA sequence features (k-mers) of differentially methylated, promoter CpG island sequences across different tissue types; 3) as only 3 or 4 bp patterns had previously been used for modeling DNA methylation susceptibility, ours is the first demonstration that 6-mer modeling can be performed without loss of accuracy.</p>
</sec>
</abstract>
<conference>
<conf-date>1-3 August 2011</conf-date>
<conf-name>ACM Conference on Bioinformatics, Computational Biology and Biomedicine 2011 (ACM-BCB)</conf-name>
<conf-loc>Chicago, IL, USA</conf-loc>
</conference>
</article-meta>
</front>
<body>
<sec>
<title>Background</title>
<p>DNA methylation is the chemical modification of DNA bases, mostly on cytosines that precede a guanosine in the DNA sequence, i.e., the CpG dinucleotides. This epigenetic modification involves the addition of a methyl group to the number 5 carbon of the cytosine pyrimidine ring. DNA methylation is essential for cellular growth, development and differentiation [
<xref ref-type="bibr" rid="B1">1</xref>
], playing a fundamental role in the activation of genes at the transcriptional level. In cancer cells, aberrant DNA methylation patterns, such as genome-wide hypomethylation and region-specific hypermethylation, are frequently observed [
<xref ref-type="bibr" rid="B2">2</xref>
]. CpG islands, short CpG-rich regions of DNA often located around gene promoters and normally protected from DNA methylation, become hypermethylated in cancer, contributing to transcriptional silencing [
<xref ref-type="bibr" rid="B3">3</xref>
,
<xref ref-type="bibr" rid="B4">4</xref>
]. As CpG island methylation patterns have been shown to differ across cancer types, recent studies have revealed that some CpG islands are "methylation sensitive", while others are "resistant" to DNA methylation [
<xref ref-type="bibr" rid="B5">5</xref>
]. Recent technological breakthroughs allow, for the first time, the capability to measure human methylomes at base resolution [
<xref ref-type="bibr" rid="B6">6</xref>
], providing unprecedented opportunities for understanding the phenomenon of methylation susceptibility.</p>
<sec>
<title>Previous work</title>
<p>Several recent studies have attempted to predict CpG island methylation patterns in normal and cancer cells. DNA pattern recognition and supervised learning techniques were used by Feltus et al to discriminate methylation-prone (MP) and methylation-resistant (MR) CpG islands based on seven DNA sequence patterns [
<xref ref-type="bibr" rid="B7">7</xref>
]. McCabe et al then developed a classifier (PatMAn) based on the frequencies of those seven patterns in cancer [
<xref ref-type="bibr" rid="B8">8</xref>
], followed by "SUPER-PatMAn" for predicting methylation susceptible CpG islands using both local sequence context and transacting factors such SUZ12 [
<xref ref-type="bibr" rid="B9">9</xref>
]. In addition, Feltus et al used motifs related to 28 MP and MR CpG islands to predict DNA methylation susceptibility [
<xref ref-type="bibr" rid="B10">10</xref>
], and Keshet et al showed evidence of instructive mechanisms in cancer cells, finding common sequence motifs in the regions of promoters whose genes show tumor-specific "methylation susceptibility" [
<xref ref-type="bibr" rid="B11">11</xref>
]. A prediction method for finding a minority class in an imbalanced data setting (which is the case for DNA methylation data), called "cluster_boost", was recently developed by Goh et al and used to identify novel hypermethylated genes in cancer [
<xref ref-type="bibr" rid="B12">12</xref>
]. Fang et al developed "MethCGI" to predict the methylation status of CpG islands using a support vector machine and both local sequence context and transcription factor binding sites [
<xref ref-type="bibr" rid="B13">13</xref>
]. Finally, a prediction method using DNA sequence features of various types, including sequence, repeats, predicted structure, CpG islands, and genes, was developed by Bock et al to predict binding sites, conservation, and single nucleotide polymorphisms [
<xref ref-type="bibr" rid="B14">14</xref>
].</p>
<p>While the focus of the above studies was on CpG island methylation susceptibility, recent experiments have convincingly demonstrated that methylation levels of CpG sites, i.e. genomic location of CpG dinucleotides, within a CpG island can be highly variable. For example, Handa et al found that certain sequence features flanking CpG sites were associated with high- and low-methylation CpG sites in an in vitro DNMT1 overexpression model [
<xref ref-type="bibr" rid="B15">15</xref>
]. Moreover, at single base pair resolution, Zhang et al demonstrated that DNA methylation levels frequently differ within a CpG island [
<xref ref-type="bibr" rid="B16">16</xref>
]. To investigate the role of DNA methylation during development in human embryonic stem cells Brunner et al developed Methyl-seq, which assays DNA methylation at more than 90,000 regions throughout the genome [
<xref ref-type="bibr" rid="B17">17</xref>
]. Using bisulfite sequencing data, Lister et al determined the first genome-wide, single-base-resolution maps of methylated cytosines in mammalian genomes (human embryonic stem cells (ESC) and fetal broblasts) [
<xref ref-type="bibr" rid="B18">18</xref>
]. By using "ultradeep" sequencing data from Taylor et al [
<xref ref-type="bibr" rid="B19">19</xref>
], we demonstrated that CpG flanking sequences can be used to model methylation susceptible CpG sites [
<xref ref-type="bibr" rid="B20">20</xref>
]. Finally, Previti et al analyzed tissue-specific CpG island methylation status, in terms of profiles created by probabilistically combining two sources of independent clusters (clusters from methylation data in 12 tissues and clusters from CGIs attributes) to demonstrate the predictive power of their method with a decision tree classifier [
<xref ref-type="bibr" rid="B21">21</xref>
]. Those investigators categorized profiles into four classes:
<italic>constitutive unmethylated, constitutive methylated, unmethylated in sperm, and differentially methylated </italic>
[
<xref ref-type="bibr" rid="B21">21</xref>
].</p>
</sec>
<sec>
<title>Motivation</title>
<p>Previous CpG island methylation susceptibility prediction studies have not considered cell type-specific methylation status. Considering variations in DNA methylation level even in the same genomic regions of different types of cells, we asked the question: can cell type-specific DNA methylation susceptibility be modeled? The significance of exploring this question is based on evidence supporting the strong association of genomic sequence features with DNA methylation status. Furthermore, recent studies strongly indicate the existence of methylation sensitive/resistant CpG islands in different cancer types [
<xref ref-type="bibr" rid="B5">5</xref>
]. In this paper, we performed a comprehensive DNA methylation susceptibility modeling study in five different cell lines at three different levels: CpG sites, entire promoter regions, and short DNA segments. We focused on DNA methylation in the context of CpG dinucleotides in adult cells (we are aware of a recent study [
<xref ref-type="bibr" rid="B18">18</xref>
] reporting non-CpG methylation in ESC).</p>
</sec>
</sec>
<sec sec-type="methods">
<title>Methods</title>
<sec>
<title>The problem: methylation susceptible dna segment modeling problem</title>
<sec>
<title>The need for segment modeling</title>
<p>Bisulfite sequencing data clearly demonstrates that methylation levels, even within a single gene promoter, can be highly variable. Furthermore, a figure in Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
shows highly variable methylation of the same promoter sequence in five different cell lines, i.e. cell type-specific DNA methylation susceptibility (bisulfite sequencing data obtained from [
<xref ref-type="bibr" rid="B16">16</xref>
]).</p>
</sec>
<sec>
<title>Definition of the problem</title>
<p>The following notations were used to formally define the problem. A small set of pre-selected k-mers
<bold>x </bold>
= {
<italic>x
<sub>i</sub>
</italic>
}, where a k-mer is fixed number of DNA base pairs. Labels
<bold>t </bold>
= {
<italic>t
<sub>j</sub>
</italic>
} on data are assigned as +/- depending on methylation level
<italic>p
<sub>j </sub>
</italic>
of each sample.</p>
<p>For each cell type, a k-mer mixture logistic regression model (equation 1) was built using a small set of pre-selected patterns, i.e.
<italic>k</italic>
-mers. To select the best logistic model, predicted methylation at a CpG site (based on the logistic model under consideration) was compared with actual CpG methylation obtained from the bisulfite sequencing data. To make the comparison, we calibrated the predicted methylation level between 0 and 1 (below).</p>
<p>
<disp-formula id="bmcM1">
<label>(1)</label>
<mml:math id="M1" name="1471-2105-13-S3-S15-i1" overflow="scroll">
<mml:mrow>
<mml:mi>y</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>e</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where
<inline-formula>
<mml:math id="M2" name="1471-2105-13-S3-S15-i2" overflow="scroll">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo mathsize="big"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
and
<italic>β
<sub>i</sub>
</italic>
's are parameters to be learned for the machine learning predictor.</p>
<sec>
<title>The k-mer mixture modeling problem</title>
<p>Our goal was to test whether methylation susceptibility can be modeled by a logistic regression model using a small set of k-mers. Although using k-mers for DNA methylation modeling is not entirely new, to our knowledge, only short k-mers (3 or 4 bp in length) were used in previous studies [
<xref ref-type="bibr" rid="B14">14</xref>
]. As short k-mers can occur in almost every DNA sequence, modeling using 3 or 4 bp relies on k-mer frequency.</p>
<p>1. First, we attempted to use longer k-mers (up to 6 bp) to utilize those that only occur in methylation susceptible sequences (vs. frequency for short k-mers, described above).</p>
<p>2. Our goal of determining whether machine learning predictors can be built by using k-mers required that we address two important issues: over-fitting and generalizability of prediction beyond the test data. The over-fitting problem was addressed by selecting a small number of k-mers from the training data set (using a larger number of k-mers can easily over-fit the training data). The cross validation technique was used to test the generalizability of prediction power. We selected k-mers and built machine predictors by using only the training data set. We then assessed the predictor on the test data set not used for either selecting k-mer features or building predictors.</p>
</sec>
</sec>
<sec>
<title>Two k-mer feature selection methods</title>
<p>We used a selected set of k-mers for DNA methylation susceptibility modeling in the different cell types. The research question explored in this paper is the feasibility of modeling methylation susceptible segments given a set of k-mers. As selection of the "best set" of k-mers for modeling was not explored (a solution to the combined problem was too difficult), we used two standard pattern selection methods for a two-class data set.</p>
<p>1. Feature selection with t-test: A popular t-test method was used to select k-mers because of its simplicity and applicability for all modeling approaches. For each attribute
<italic>a</italic>
, occurrences of
<italic>a </italic>
were counted in positive samples and negative samples. Then, the P-value of
<italic>a </italic>
was measured by t-test. A fixed number of patterns was selected from a list of k-mers ordered by P-value. Alternatively, patterns with a P-value below a threshold were selected.</p>
<p>2. Feature selection with the random forest technique: The RF algorithm [
<xref ref-type="bibr" rid="B22">22</xref>
] can be used for feature selection. The usefulness of the RF-based feature selection method was clearly demonstrated by Yi-Wei Chen and Chih-Jen Lin at the NIPS 2003 feature selection challenge [
<xref ref-type="bibr" rid="B23">23</xref>
]. We used an extended version of the RF-based feature selection method. Multiple rounds of the RF-based feature selection were performed using a balanced data set of methylation-susceptible and non-susceptible sequences. We performed
<italic>k </italic>
times of RF runs, where each RF run used
<italic>n </italic>
random trees; only top
<italic>N </italic>
attributes with z-scores > 0 were collected. After
<italic>k </italic>
RF runs, a subset of attributes, which had appeared
<italic>p</italic>
% times, were selected. The values were set
<italic>k </italic>
= 30,
<italic>n </italic>
= 100,
<italic>N </italic>
= 100, and
<italic>p </italic>
= 90 for the k-mer feature selection.</p>
<p>In both methods, we extracted a set of patterns in the balanced data set. First, centered at each CpG site, we extracted a flanking sequence of length
<italic>l</italic>
, where we set
<italic>l </italic>
= 100. A label of the CpG site was given as +/- depending on methylation level. Then, we balanced the data with even number of +/- classes. A set of all k-mers obtained in sliding windows on each sequence were used for k-mer feature selection.</p>
</sec>
<sec>
<title>Modeling methylation levels of DNA segments</title>
<sec>
<title>Definition A boundary variable</title>
<p>
<italic>B
<sub>i </sub>
</italic>
at a genomic sequence position is an indicator variable that is defined where two adjacent CpG sites have different labels. The value 1 of
<italic>B
<sub>i </sub>
</italic>
denotes that the genomic position is a boundary and the value 0 denotes that the position is not a boundary. A DNA segment
<italic>S </italic>
is defined by two boundary variables
<italic>B
<sub>a </sub>
</italic>
and
<italic>B
<sub>z </sub>
</italic>
where
<italic>B
<sub>a </sub>
</italic>
= 1 and
<italic>B
<sub>z </sub>
</italic>
= 1 and for all
<italic>a </italic>
<
<italic>i </italic>
<
<italic>z</italic>
,
<italic>B
<sub>i </sub>
</italic>
= 0. Figure
<xref ref-type="fig" rid="F1">1</xref>
illustrates how boundary variables are used to define 10 segments. We call a set of DNA segments defined by the boundary variables a
<bold>
<italic>configuration</italic>
</bold>
.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption>
<p>
<bold>Illustration of the initial segment definition.</bold>
Because all boundary variables are set to 1, 10 initial segments are defined. Later, the segment modeling algorithm considers alternative segment definition by changing the boundary variable values. Figure was modified from [
<xref ref-type="bibr" rid="B16">16</xref>
,
<xref ref-type="bibr" rid="B25">25</xref>
].</p>
</caption>
<graphic xlink:href="1471-2105-13-S3-S15-1"></graphic>
</fig>
</sec>
<sec>
<title>Labeling data</title>
<p>Given a segment
<italic>S
<sub>i</sub>
</italic>
, the methylation probability
<italic>p
<sub>i </sub>
</italic>
of a segment was defined as a ratio of the number of CpG sites with the + label to the number of CpG sites in the segment. Then, the label
<italic>t
<sub>i </sub>
</italic>
of
<italic>S
<sub>i </sub>
</italic>
was assigned + if
<italic>p
<sub>i </sub>
</italic>
is greater than 0.5. Otherwise, a label - was assigned to
<italic>t
<sub>i</sub>
</italic>
.</p>
</sec>
<sec>
<title>Attributes for modeling</title>
<p>K-mer occurrences in segments in the training data set were used as attributes. A small subset of k-mers features
<bold>x </bold>
was selected from all k-mers using the feature selection methods.</p>
</sec>
<sec>
<title>Modeling</title>
<p>A single logistic regression model was used to model all DNA segments for each cell line, using attributes
<bold>x </bold>
and labels
<bold>t</bold>
.</p>
</sec>
</sec>
<sec>
<title>Segment-level modeling challenges: exponential search space</title>
<p>Although the methylation status of a DNA segment is defined by an aggregation of the methylation status of individual promoter regions (as we did for the whole promoter region-modeling approach), how to define methylation susceptible DNA segments is currently unknown. For example, consider a DNA segment with five CpG sites {
<italic>s</italic>
1,
<italic>s</italic>
2,
<italic>s</italic>
3,
<italic>s</italic>
4,
<italic>s</italic>
5} in a short DNA segment and assume that three sites,
<italic>s</italic>
1,
<italic>s</italic>
2,
<italic>s</italic>
4 are methylation susceptible and the other two sites
<italic>s</italic>
3,
<italic>s</italic>
5 are resistance methylation. By definition, the DNA segment is methylation susceptible, as the majority of sites (three) are methylation susceptible. However, if we divide the segment into two sub-segments, {
<italic>s</italic>
1,
<italic>s</italic>
2} and {
<italic>s</italic>
3,
<italic>s</italic>
4,
<italic>s</italic>
5}, there will be a segment that is susceptible to methylation and one that is resistant. To determine which of the two segment definitions can be better modeled for methylation susceptibility, enumeration of all possible definitions of segment configurations and for each definition of segment is required. We thus computed a "best fit" logistic model for methylation data in a cell line. The complexity of this problem can be discussed in terms of the well-known "counting the number of parenthesization" problem [
<xref ref-type="bibr" rid="B24">24</xref>
], because a parenthesis can define a segment of CpG sites. The number of parenthesis
<italic>P </italic>
(
<italic>n</italic>
) for
<italic>n </italic>
CpG sites is
<italic>P </italic>
(1) = 1;
<inline-formula>
<mml:math id="M3" name="1471-2105-13-S3-S15-i3" overflow="scroll">
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mo class="MathClass-op"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
for
<italic>n </italic>
≥ 2. Given the complexity, an optimal solution using an exhaustive search algorithm is unlikely to be found (known to be Ω (2
<italic>
<sup>n</sup>
</italic>
) [
<xref ref-type="bibr" rid="B24">24</xref>
]). Thus, we developed a heuristic algorithm that used a random segment merging starting from the finest definition of segments.</p>
</sec>
</sec>
<sec>
<title>A random binary segment merging algorithm</title>
<p>A Naïve approach to segment modeling simply enumerates all possible segment configurations. Every combination of segment boundaries is considered, while changing the setting of values for boundary indicator variable
<italic>B
<sub>i </sub>
</italic>
∈ {0, 1}. Then, an error function for each segment set definition is computed. However, this requires the enumeration of a 2
<italic>
<sup>m </sup>
</italic>
possible segment configurations, where
<italic>m </italic>
is the number of
<italic>B
<sub>i</sub>
</italic>
. To compute the optimal k-mer logistic regression model, segment boundaries must first be identified; however, as these are unknown, we started with an initial presumption of the methylation susceptible and resistant segments. We then used an iterative improvement procedure in search of both the segment definition and the best fitting logistic regression model. The major steps of the segment modeling algorithm are as follows:</p>
<p>1.
<bold>Initialization of a configuration: </bold>
Define a boundary variable
<italic>B
<sub>i </sub>
</italic>
= 1 at every genomic position where labels (+ or -) of two adjacent CpG sites around the position are different. Define a segment as a DNA region between two boundary variables set to 1. By taking this approach, we start with a configuration of smallest possible segments. By merging segments in many different ways and re-calculating the logistic regression model, the algorithm attempts to find the best segment configuration. This is how I
<sc>NITIAL</sc>
C
<sc>ONFIGURATION</sc>
() is implemented in the HillClimbingConfigurationSearch in Algorithm 1.</p>
<p>2.
<bold>Computing a logistic regression model</bold>
: Given a k-mer occurrence and a segment configuration, compute a logistic regression model by (1). This is how
<sc>COMPUTE</sc>
M
<sc>ODEL</sc>
() is implemented in the HillClimbingConfigurationSearch in Algorithm 1.</p>
<p>3.
<bold>Computing an error of a segment configuration</bold>
: Errors in the segment set
<inline-formula>
<mml:math id="M4" name="1471-2105-13-S3-S15-i4" overflow="scroll">
<mml:mi mathvariant="script">S</mml:mi>
</mml:math>
</inline-formula>
are measured by (2).</p>
<p>
<disp-formula id="bmcM2">
<label>(2)</label>
<mml:math id="M5" name="1471-2105-13-S3-S15-i5" overflow="scroll">
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="script">S</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo mathsize="big"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo class="MathClass-rel">|</mml:mo>
<mml:mi mathvariant="script">S</mml:mi>
<mml:mo class="MathClass-rel">|</mml:mo>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>ŷ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where
<inline-formula>
<mml:math id="M6" name="1471-2105-13-S3-S15-i6" overflow="scroll">
<mml:mo class="MathClass-rel">|</mml:mo>
<mml:mi mathvariant="script">S</mml:mi>
<mml:mo class="MathClass-rel">|</mml:mo>
</mml:math>
</inline-formula>
is the total number of segments,
<inline-formula>
<mml:math id="M7" name="1471-2105-13-S3-S15-i7" overflow="scroll">
<mml:msub>
<mml:mrow>
<mml:mi>ŷ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>
is the predicted methylation level of the segment
<italic>i</italic>
,
<italic>t
<sub>i </sub>
</italic>
is the actual methylation level of the segment
<italic>i</italic>
, and
<italic>w
<sub>i </sub>
</italic>
is the weight of each segment. A segment weight is defined as
<inline-formula>
<mml:math id="M8" name="1471-2105-13-S3-S15-i8" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mover accent="false" class="mml-overline">
<mml:mrow>
<mml:mfenced close="|" open="|">
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo accent="true">¯</mml:mo>
</mml:mover>
<mml:mo class="MathClass-bin">/</mml:mo>
<mml:mfenced close="|" open="|">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
</inline-formula>
, where
<inline-formula>
<mml:math id="M9" name="1471-2105-13-S3-S15-i9" overflow="scroll">
<mml:mover accent="false" class="mml-overline">
<mml:mrow>
<mml:mfenced close="|" open="|">
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo accent="true">¯</mml:mo>
</mml:mover>
</mml:math>
</inline-formula>
is the average count of CpG sites in all segments and |
<italic>S
<sub>i</sub>
| </italic>
is the count of CpG in a segment. A weight of each segment
<italic>w
<sub>i </sub>
</italic>
is given as an inverse proportion to average segment size. In this way, large segments are penalized less, and vice versa. This is how
<sc>COMPUTE</sc>
E
<sc>RROR</sc>
() is implemented in the HillClimbingConfigurationSearch in Algorithm 1.</p>
<sec>
<title>The random binary segment merging algorithm</title>
<p>Given the current segment configuration {
<italic>B
<sub>i</sub>
</italic>
}, a segment is randomly chosen using a distribution of errors measured by a weighted square error. For a segment
<italic>B
<sub>j</sub>
</italic>
, the weighted square error is defined by
<inline-formula>
<mml:math id="M10" name="1471-2105-13-S3-S15-i10" overflow="scroll">
<mml:msub>
<mml:mrow>
<mml:mi>e</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>ŷ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>
where the weight of the segment
<inline-formula>
<mml:math id="M11" name="1471-2105-13-S3-S15-i11" overflow="scroll">
<mml:msub>
<mml:mrow>
<mml:mi>β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfenced close="|" open="|">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo class="MathClass-bin">/</mml:mo>
<mml:mover accent="false" class="mml-overline">
<mml:mrow>
<mml:mfenced close="|" open="|">
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo accent="true">¯</mml:mo>
</mml:mover>
</mml:math>
</inline-formula>
,
<inline-formula>
<mml:math id="M12" name="1471-2105-13-S3-S15-i12" overflow="scroll">
<mml:msub>
<mml:mrow>
<mml:mi>ŷ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>
is the predicted methylation level of the segment
<italic>j</italic>
, and
<italic>t
<sub>j </sub>
</italic>
is the actual methylation level of the segment
<italic>j</italic>
. A segment is chosen by random sampling using a segment error vector <
<italic>e</italic>
<sub>1</sub>
, . . . ,
<italic>e</italic>
<sub>n </sub>
> where
<italic>n </italic>
is the number of segments in the current segment configuration. The random sampling using a segment error vector <
<italic>e</italic>
<sub>1</sub>
, . . . ,
<italic>e
<sub>n </sub>
</italic>
> guides choosing a segment with a higher prediction error, but also ensure a random sampling. Note that segments that are already considered for merging are excluded for the next round of sampling (see the use of
<monospace>visit[]</monospace>
in the HillClimbingConfigurationSearch in Algorithm 1).</p>
<p>Once a segment
<italic>B
<sub>j </sub>
</italic>
is chosen, it is tentatively merged with segment
<italic>B
<sub>j+1 </sub>
</italic>
next to
<italic>B
<sub>j</sub>
</italic>
. Then a logistic regression model is re-calculated. The two segment merging is accepted only if the merging of two segments reduces the weighted squared error (equation 2). Otherwise, the original segment configuration is retained, rejecting the merging. A segment
<italic>B
<sub>j </sub>
</italic>
considered for merging is marked so that the segment will not be repeatedly chosen for the next step. This sampling and marking a segment is repeated until all segments in the current configuration are considered for merging.</p>
<p>
<bold>Input </bold>
: A set of pre-selected k-mers K = {
<italic>x
<sub>i</sub>
</italic>
}; Occurrences of K; Methylation levels at CpG sites</p>
<p>
<bold>Output</bold>
: A logistic regression model; A segment configuration.</p>
<p>
<bold>HillClimbingConfigurationSearch</bold>
(N)</p>
<p>begin</p>
<p>   (
<italic>C</italic>
*,
<italic>E</italic>
*,
<italic>M</italic>
*) =
<monospace>RandomConfigurationSearch ()</monospace>
</p>
<p>   
<bold>for </bold>
<italic>i </italic>
← 2
<bold>to </bold>
<italic>N </italic>
<bold>do</bold>
</p>
<p>      (
<italic>C</italic>
,
<italic>M</italic>
,
<italic>E</italic>
) =
<monospace>RandomConfigurationSearch ()</monospace>
</p>
<p>      
<bold>if </bold>
<italic>E </italic>
<
<italic>E</italic>
*
<bold>then</bold>
</p>
<p>         
<italic>C</italic>
* =
<italic>C</italic>
;
<italic>M</italic>
* =
<italic>M</italic>
;
<italic>E</italic>
* =
<italic>E</italic>
</p>
<p>      
<bold>end</bold>
</p>
<p>      
<bold>report </bold>
(
<italic>C</italic>
*,
<italic>M</italic>
*,
<italic>E</italic>
*)</p>
<p>   
<bold>end</bold>
</p>
<p>end</p>
<p>
<bold>RandomConfigurationSearch </bold>
( )</p>
<p>begin</p>
<p>   
<italic>C </italic>
=
<monospace>InitialConfiguration ()</monospace>
;
<italic>E </italic>
= 1.0         
<monospace>//Reset configuration; See text.</monospace>
</p>
<p>   
<bold>while </bold>
<italic>true </italic>
<bold>do</bold>
</p>
<p>      (C',M',E') =
<monospace>RandomBinaryMerging(</monospace>
<italic>C</italic>
<monospace>)</monospace>
</p>
<p>      
<bold>if </bold>
(
<italic>E </italic>
-
<italic>E</italic>
') ≤
<italic>δ </italic>
<bold>then break</bold>
</p>
<p>      
<italic>C </italic>
=
<italic>C'</italic>
;
<italic>M </italic>
=
<italic>M</italic>
';
<italic>E </italic>
=
<italic>E'</italic>
</p>
<p>      
<bold>return </bold>
<italic>(C,M,E)</italic>
</p>
<p>   
<bold>end</bold>
</p>
<p>end</p>
<p>
<bold>RandomBinaryMerging</bold>
(
<bold>configuration </bold>
<italic>C</italic>
)</p>
<p>begin</p>
<p>   
<italic>M </italic>
=
<monospace>computeModel(</monospace>
<italic>C</italic>
,
<italic>K </italic>
<monospace>)</monospace>
         
<monospace>//Equation 1; Training stage only</monospace>
</p>
<p>   
<italic>E </italic>
=
<monospace>computeError(</monospace>
<italic>C</italic>
,
<italic>M </italic>
<monospace>)</monospace>
         
<monospace>//Equation 2</monospace>
</p>
<p>   
<bold>bool </bold>
<italic>visit</italic>
[
<italic>n</italic>
] =
<bold>{false</bold>
}         
<monospace>//Mark that no segments are considered.</monospace>
</p>
<p>   
<bold>while </bold>
<italic>i such that visit</italic>
[
<italic>i</italic>
] = =
<bold>false do</bold>
</p>
<p>      
<italic>j </italic>
=
<monospace>selectAtRandom(</monospace>
<italic>visit</italic>
<monospace>)</monospace>
         
<monospace>//See text.</monospace>
</p>
<p>      
<italic>visit</italic>
[
<italic>j</italic>
] =
<bold>true </bold>
         
<monospace>//</monospace>
<italic>s
<sub>j </sub>
</italic>
<monospace>is merge candidate.</monospace>
</p>
<p>      
<italic>C</italic>
' =
<italic>C</italic>
</p>
<p>      
<inline-formula>
<mml:math id="M13" name="1471-2105-13-S3-S15-i13" overflow="scroll">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>
=
<bold>false </bold>
         
<monospace>//Merge</monospace>
<italic>s
<sub>j </sub>
</italic>
<monospace>and </monospace>
<italic>s
<sub>j+1</sub>
</italic>
.</p>
<p>      
<italic>M</italic>
' =
<monospace>computeModel(</monospace>
<italic>C</italic>
',
<italic>K </italic>
<monospace>)</monospace>
         
<monospace>//Equation 1; Training stage only</monospace>
</p>
<p>      
<italic>E</italic>
' =
<monospace>computeError(</monospace>
<italic>C</italic>
',
<italic>M</italic>
'
<monospace>)</monospace>
         
<monospace>//Equation 2</monospace>
</p>
<p>      
<bold>if </bold>
<italic>E </italic>
<italic>E</italic>
'
<bold>then</bold>
</p>
<p>         
<italic>C </italic>
=
<italic>C</italic>
';
<italic>visit</italic>
[
<italic>j </italic>
+ 1] =
<bold>true </bold>
         
<monospace>//Accept </monospace>
<italic>C</italic>
'.</p>
<p>      
<bold>else</bold>
</p>
<p>         
<inline-formula>
<mml:math id="M14" name="1471-2105-13-S3-S15-i14" overflow="scroll">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>
=
<bold>true </bold>
         
<monospace>//Reject</monospace>
<italic>C</italic>
'.</p>
<p>      
<bold>end</bold>
</p>
<p>   
<bold>end</bold>
</p>
<p>   
<bold>return </bold>
<italic>(C,M,E)</italic>
</p>
<p>end</p>
<p>
<bold>Algorithm 1: </bold>
Hill climbing configuration search algorithm. An algorithm tries to merge two segments at random until all segments are considered for merging. A new configuration is accepted only when the error is reduced with a new logistic regression model, thus it is a hill climbing algorithm.</p>
</sec>
</sec>
</sec>
<sec>
<title>Results</title>
<sec>
<title>Data set</title>
<p>We used data from Zhang et al [
<xref ref-type="bibr" rid="B16">16</xref>
] for DNA methylation patterns in chromosome 21 (297 amplicons from 190 gene promoters using bisulfite conversion, subcloning and sequencing DNA as the major experimental methods). The bisulfite sequencing data were collected in five cell types: viz. human peripheral blood (primarily leukocytes), fibroblast, the human embryonic kidney cell line HEK293, the human hepatocellular liver carcinoma cell line HepG2 and fibroblast cells derived from a patient with Down syndrome (trisom 21). Methylation patterns differed widely and specific to each cell types.</p>
</sec>
<sec>
<title>Experimental setup</title>
<p>The 10-fold cross validation (described above) was used to compare the performances of three modeling approaches. For each round of 10-fold validation, one of the 10 subsets was set aside for testing, and the k-mer features were selected only from the training set, ensuring that the test data would have no influence on the k-mer feature selection. Also, regression coefficients were computed in only training stage. We measured the area under the ROC curve (AUC) score for performance comparison.</p>
</sec>
<sec>
<title>Effectiveness of the segment modeling approach</title>
<p>We extensively tested the effectiveness of the segment modeling algorithm using 4-mer, 5-mer, and 6-mer patterns. For each of the experiments, the AUC score was measured from 10-fold cross validation for the initial segment definition vs. the final segment definition. The RF-based algorithm with 100 trees was used for k-mer feature selection. For each k-mer selection procedure, 30 random experiments were performed, and k-mers with z-score > 0 that appeared in at least 90% of experiments were selected as k-mer features. Using the set of k-mers, the optimal logistic regression model was computed.</p>
</sec>
<sec>
<title>10-fold cross validation experiments</title>
<p>The performance comparison between the initial segments and the final segments in the test set is shown in Figure
<xref ref-type="fig" rid="F2">2</xref>
. Bars between adjacent dotted lines show the improvement in the between prediction results of two models with the initial segment setting and the final segment setting in terms of the AUC scores. We measured the performance improvement using 4-mer, 5-mer, and 6-mer features. For each cell type, the segment modeling algorithm identified significantly improved segment definitions. Five panels in each plot correspond to tissue types: (A) Fibroblast, (B) HEK293, (C) HepG2, (D) Leukocytes, and (E) Trisom 21. Our algorithm achieved approximately 10% improvement in most cell types, illustrating the effectiveness of the segment modeling algorithm.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption>
<p>
<bold>Effectiveness of segment modeling in 10-fold cross-validation experiments.</bold>
Bars between adjacent dotted lines show improvement in the between prediction results of two models with the initial segment setting and the final segment setting in terms of AUC scores. We measured the performance improvement using 4-mer, 5-mer, and 6-mer features. For each cell type, the segment modeling algorithm identified significantly improved segment definitions. Five panels in each plot corresponds to tissue types: (A) Fibroblast, (B) HEK293, (C) HepG2, (D) Leukocytes, and (E) Trisom 21.</p>
</caption>
<graphic xlink:href="1471-2105-13-S3-S15-2"></graphic>
</fig>
</sec>
<sec>
<title>Search behavior</title>
<p>The search behavior of the segment modeling algorithm is shown in Figure
<xref ref-type="fig" rid="F3">3</xref>
. In this experiment, we used the whole data set to show the algorithmic convergence of our approach. The learning error (Equation 2) was reduced at each iteration of segment merging and model re-calculation. Our random segment sampling algorithm converged for all 15 cases of 5 different cell lines with 4-, 5-, and 6-mers.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption>
<p>
<bold>The search behavior of the segment modeling algorithm using the whole data set.</bold>
Pairwise plots showing reduced learning error (2) at each iteration of segment merging and model recalculation. The columns for the pairwise plots are k-mers; rows are cell lines. In each plot, the X-axis denotes the number of iterations and the weighted squared prediction error is denoted on the Y-axis. The HillClimbing search algorithm effectively reduced the error between prediction and observation. In fitting the whole data set, as opposed to 10 fold cross validation, the final model predicted methylation susceptibility in the different cell types.</p>
</caption>
<graphic xlink:href="1471-2105-13-S3-S15-3"></graphic>
</fig>
</sec>
<sec>
<title>Discussion on the predictive power of the model</title>
<p>The predictive power of the model measured by 10-fold cross validation is encouraging. For 6-mers, the predictive accuracy was 0.69 for Fibroblast, 0.70 for HEK293, 0.54 for HepG2, 0.73 for Leukocytes, and 0.65 for Trisom 21. These prediction accuracies using 6-mer cannot be achieved in random data sets where the expected prediction accuracy is 0.5. Variations in the prediction accuracy for the five cell types, especially for HepG2, may be due to the cell type specific characteristics. On the other hand, the data obtained from [
<xref ref-type="bibr" rid="B16">16</xref>
] was of a low coverage. Amplicons covered less than 0.2% of entire Chromosome 21. Thus variations in the prediction accuracy may due to the low coverage of the data used. We were not able to further verify why the prediction accuracy varied. In fitting the whole data set, as opposed to 10 fold cross validation, the final model was able to accurately predict methylation susceptibility.</p>
</sec>
<sec>
<title>Effect of the number of k-mers used for prediction</title>
<p>The three modeling approaches were compared in terms of AUC obtained by 10-fold cross-validation technique. We conducted comprehensive modeling of cell-type specific DNA methylation susceptibility at three different resolutions: individual CpG sites, CpG segments, and promoter regions in terms of AUC obtained by the 10-fold cross validation technique. The methods for modeling at individual CpG sites and at promoter regions are described in Additional file
<xref ref-type="supplementary-material" rid="S2">2</xref>
. To measure the effect of the number of k-mer patterns used for modeling, 10-fold cross-validations were performed with a varying number of k-mer patterns from 10 to 100 (with an increase of 10 k-mers). P-values from t-tests were used to select the k-mers. The experimental results are illustrated in Figure
<xref ref-type="fig" rid="F4">4</xref>
. Only the segment modeling approach was effective for all 4-, 5-, and 6-mer experiments. Interestingly, the number of k-mers used for modeling had little impact on the prediction result, demonstrating that the prediction accuracy did not derive from the over-fitting the data and indicating that the selection of a small number of k-mers can effectively model methylation susceptibility without a loss of prediction power. Moreover, when a longer k-mer was used (up to 6-mer), the prediction accuracy did not decrease. This finding is highly encouraging because on average, a 6-mer is unlikely to occur by chance in a short (274 bp) DNA segment. Thus, a set of 6-mers can be used to model DNA methylation susceptibility.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption>
<p>
<bold>Effect of the number of k-mers used for three modeling approaches.</bold>
The performance of three modeling approaches was measured from 10-fold cross-validation. Each bar is the AUC value of the experiment. X-axis is the number of most significant variables (p-value in t-test) used in each experiment. Consistently in 4-mer to 6-mer and regardless of number of patterns, segment modeling outperformed other modeling approaches. More importantly, from the experiments using variable numbers of k-mers from 10 to 100, we have shown that the selection of k-mers does not have a big impact on the model performances and the higher accuracies of the segment modeling approach, compared to the promoter and site-specific modeling approaches, is likely due to the effectiveness of the segment model.</p>
</caption>
<graphic xlink:href="1471-2105-13-S3-S15-4"></graphic>
</fig>
</sec>
</sec>
<sec>
<title>Conclusion</title>
<p>We conducted a comprehensive modeling study for cell-type specific DNA methylation susceptibility. By performing extensive computational experiments of data from five distinct cell types, we show that DNA methylation susceptibility can be accurately modeled at the segment level, achieving up to 0.75 in AUC prediction accuracy in a 10-fold cross validation study. The two-step iterative segment modeling algorithm successfully identified optimal segments that can be modeled as a logistic regression model using a set of k-mers. Our model further shows the significance of certain k-mers for the mixture model, which can potentially highlight DNA sequence features (k-mers) of differentially methylated promoter CpG island sequences in different cells and tissues, including malignancies. As only used 4 bp patterns were used in previous modeling studies of DNA methylation susceptibility, this is the first report to show that k-mer modeling can be performed using up to 6-mer without the loss of modeling accuracy.</p>
</sec>
<sec>
<title>List of abbreviations used</title>
<p>• AUC: area under the ROC curve; • DNA: deoxyribonucleic acid; • MP: methylation-prone; • MR: methylation-resistant; • RF: random forest; • YY: Youngik Yang; • SK: Sun Kim; • KN: Ken Nephew.</p>
</sec>
<sec>
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec>
<title>Authors' contributions</title>
<p>YY designed the computational framework, conducted simulation, and wrote the manuscript. KN gave critical input on biological discussion of this work, and drafted the manuscript. SK led the project, designed the algorithm and tests, and drafted the manuscript.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material content-type="local-data" id="S1">
<caption>
<title>Additional file 1</title>
<p>
<bold>DNA methylation level variation.</bold>
A figure in the file shows DNA methylation level variation in an amplicon from 5 cell types.</p>
</caption>
<media xlink:href="1471-2105-13-S3-S15-S1.pdf" mimetype="application" mime-subtype="pdf">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S2">
<caption>
<title>Additional file 2</title>
<p>
<bold>Competing modeling approaches.</bold>
Compared to segment modeling, two competing modelings, CpG site-specfic modeling and promoter region modeling were described.</p>
</caption>
<media xlink:href="1471-2105-13-S3-S15-S2.pdf" mimetype="application" mime-subtype="pdf">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<sec>
<title>Acknowledgements and funding</title>
<p>This work supported by NIH U54 CA11300-02 (Interrogating Epigenetic Changes in Cancer Genomes) to SK and KN and by Korea National Research Foundation 0543-20110016 to SK.</p>
<p>This article has been published as part of
<italic>BMC Bioinformatics </italic>
Volume 13 Supplement 3, 2012: ACM Conference on Bioinformatics, Computational Biology and Biomedicine 2011. The full contents of the supplement are available online at
<ext-link ext-link-type="uri" xlink:href="http://www.biomedcentral.com/1471-2105/13/S3">http://www.biomedcentral.com/1471-2105/13/S3</ext-link>
.</p>
</sec>
<ref-list>
<ref id="B1">
<mixed-citation publication-type="journal">
<name>
<surname>Bird</surname>
<given-names>A</given-names>
</name>
<article-title>DNA methylation patterns and epigenetic memory</article-title>
<source>Genes Dev</source>
<year>2002</year>
<volume>16</volume>
<fpage>6</fpage>
<lpage>21</lpage>
<pub-id pub-id-type="doi">10.1101/gad.947102</pub-id>
<pub-id pub-id-type="pmid">11782440</pub-id>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="journal">
<name>
<surname>Jones</surname>
<given-names>PA</given-names>
</name>
<name>
<surname>Laird</surname>
<given-names>PW</given-names>
</name>
<article-title>Cancer-epigenetics comes of age</article-title>
<source>Nat Genet</source>
<year>1999</year>
<volume>21</volume>
<issue>2</issue>
<fpage>163</fpage>
<lpage>167</lpage>
<pub-id pub-id-type="doi">10.1038/5947</pub-id>
<pub-id pub-id-type="pmid">9988266</pub-id>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal">
<name>
<surname>Ting</surname>
<given-names>AH</given-names>
</name>
<name>
<surname>McGarvey</surname>
<given-names>KM</given-names>
</name>
<name>
<surname>Baylin</surname>
<given-names>SB</given-names>
</name>
<article-title>The cancer epigenome-components and functional correlates</article-title>
<source>Genes Dev</source>
<year>2006</year>
<volume>20</volume>
<issue>23</issue>
<fpage>3215</fpage>
<lpage>3231</lpage>
<pub-id pub-id-type="doi">10.1101/gad.1464906</pub-id>
<pub-id pub-id-type="pmid">17158741</pub-id>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="journal">
<name>
<surname>Herman</surname>
<given-names>JG</given-names>
</name>
<name>
<surname>Baylin</surname>
<given-names>SB</given-names>
</name>
<article-title>Gene silencing in cancer in association with promoter hypermethylation</article-title>
<source>N Engl J Med</source>
<year>2003</year>
<volume>349</volume>
<issue>21</issue>
<fpage>2042</fpage>
<lpage>2054</lpage>
<pub-id pub-id-type="doi">10.1056/NEJMra023075</pub-id>
<pub-id pub-id-type="pmid">14627790</pub-id>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<name>
<surname>Costello</surname>
<given-names>JF</given-names>
</name>
<name>
<surname>Frühwald</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Smiraglia</surname>
<given-names>DJ</given-names>
</name>
<name>
<surname>Rush</surname>
<given-names>LJ</given-names>
</name>
<name>
<surname>Robertson</surname>
<given-names>GP</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Wright</surname>
<given-names>FA</given-names>
</name>
<name>
<surname>Feramisco</surname>
<given-names>JD</given-names>
</name>
<name>
<surname>Peltomäki</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Lang</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Schuller</surname>
<given-names>DE</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Bloomfield</surname>
<given-names>CD</given-names>
</name>
<name>
<surname>Caligiuri</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Yates</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Nishikawa</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Su Huang</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Petrelli</surname>
<given-names>NJ</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>O'Dorisio</surname>
<given-names>MS</given-names>
</name>
<name>
<surname>Held</surname>
<given-names>WA</given-names>
</name>
<name>
<surname>Cavenee</surname>
<given-names>WK</given-names>
</name>
<name>
<surname>Plass</surname>
<given-names>C</given-names>
</name>
<article-title>Aberrant CpG-island methylation has non-random and tumour-type-specific patterns</article-title>
<source>Nat Genet</source>
<year>2000</year>
<volume>24</volume>
<issue>2</issue>
<fpage>132</fpage>
<lpage>138</lpage>
<pub-id pub-id-type="doi">10.1038/72785</pub-id>
<pub-id pub-id-type="pmid">10655057</pub-id>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="journal">
<name>
<surname>Laird</surname>
<given-names>PW</given-names>
</name>
<article-title>Principles and challenges of genome-wide DNA methylation analysis</article-title>
<source>Nat Rev Genet</source>
<year>2010</year>
<volume>11</volume>
<issue>3</issue>
<fpage>191</fpage>
<lpage>203</lpage>
<pub-id pub-id-type="pmid">20125086</pub-id>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="journal">
<name>
<surname>Feltus</surname>
<given-names>FA</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>EK</given-names>
</name>
<name>
<surname>Costello</surname>
<given-names>JF</given-names>
</name>
<name>
<surname>Plass</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Vertino</surname>
<given-names>PM</given-names>
</name>
<article-title>Predicting aberrant CpG island methylation</article-title>
<source>Proc Natl Acad Sci USA</source>
<year>2003</year>
<volume>100</volume>
<issue>21</issue>
<fpage>12253</fpage>
<lpage>12258</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.2037852100</pub-id>
<pub-id pub-id-type="pmid">14519846</pub-id>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="journal">
<name>
<surname>Prüfer</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Stenzel</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Dannemann</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Green</surname>
<given-names>RE</given-names>
</name>
<name>
<surname>Lachmann</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kelso</surname>
<given-names>J</given-names>
</name>
<article-title>PatMaN: rapid alignment of short sequences to large databases</article-title>
<source>Bioinformatics</source>
<year>2008</year>
<volume>24</volume>
<issue>13</issue>
<fpage>1530</fpage>
<lpage>1531</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btn223</pub-id>
<pub-id pub-id-type="pmid">18467344</pub-id>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="journal">
<name>
<surname>McCabe</surname>
<given-names>MT</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>EK</given-names>
</name>
<name>
<surname>Vertino</surname>
<given-names>PM</given-names>
</name>
<article-title>A multifactorial signature of DNA sequence and polycomb binding predicts aberrant CpG island methylation</article-title>
<source>Cancer Res</source>
<year>2009</year>
<volume>69</volume>
<fpage>282</fpage>
<lpage>291</lpage>
<pub-id pub-id-type="doi">10.1158/0008-5472.CAN-08-3274</pub-id>
<pub-id pub-id-type="pmid">19118013</pub-id>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal">
<name>
<surname>Feltus</surname>
<given-names>FA</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>EK</given-names>
</name>
<name>
<surname>Costello</surname>
<given-names>JF</given-names>
</name>
<name>
<surname>Plass</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Vertino</surname>
<given-names>PM</given-names>
</name>
<article-title>DNA motifs associated with aberrant CpG island methylation</article-title>
<source>Genomics</source>
<year>2006</year>
<volume>87</volume>
<issue>5</issue>
<fpage>572</fpage>
<lpage>579</lpage>
<pub-id pub-id-type="doi">10.1016/j.ygeno.2005.12.016</pub-id>
<pub-id pub-id-type="pmid">16487676</pub-id>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="journal">
<name>
<surname>Keshet</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Schlesinger</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Farkash</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Rand</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Hecht</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Segal</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Pikarski</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Young</surname>
<given-names>RA</given-names>
</name>
<name>
<surname>Niveleau</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Cedar</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Simon</surname>
<given-names>I</given-names>
</name>
<article-title>Evidence for an instructive mechanism of de novo methylation in cancer cells</article-title>
<source>Nat Genet</source>
<year>2006</year>
<volume>38</volume>
<issue>2</issue>
<fpage>149</fpage>
<lpage>153</lpage>
<pub-id pub-id-type="doi">10.1038/ng1719</pub-id>
<pub-id pub-id-type="pmid">16444255</pub-id>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="journal">
<name>
<surname>Goh</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Murphy</surname>
<given-names>SK</given-names>
</name>
<name>
<surname>Muhkerjee</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Furey</surname>
<given-names>TS</given-names>
</name>
<article-title>Genomic sweeping for hypermethylated genes</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<issue>3</issue>
<fpage>281</fpage>
<lpage>288</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btl620</pub-id>
<pub-id pub-id-type="pmid">17148511</pub-id>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal">
<name>
<surname>Fang</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Fan</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>MQ</given-names>
</name>
<article-title>Predicting methylation status of CpG islands in the human brain</article-title>
<source>Bioinformatics</source>
<year>2006</year>
<volume>22</volume>
<issue>18</issue>
<fpage>2204</fpage>
<lpage>2209</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btl377</pub-id>
<pub-id pub-id-type="pmid">16837523</pub-id>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="journal">
<name>
<surname>Bock</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Paulsen</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Tierling</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Mikeska</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Lengauer</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Walter</surname>
<given-names>J</given-names>
</name>
<article-title>CpG island methylation in human lymphocytes is highly correlated with DNA sequence, repeats, and predicted DNA structure</article-title>
<source>PLoS Genet</source>
<year>2006</year>
<volume>2</volume>
<issue>3</issue>
<fpage>e26</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pgen.0020026</pub-id>
<pub-id pub-id-type="pmid">16520826</pub-id>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="journal">
<name>
<surname>Handa</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Jeltsch</surname>
<given-names>A</given-names>
</name>
<article-title>Profound flanking sequence preference of Dnmt3a and Dnmt3b mammalian DNA methyltransferases shape the human epigenome</article-title>
<source>J Mol Biol</source>
<year>2005</year>
<volume>348</volume>
<issue>5</issue>
<fpage>1103</fpage>
<lpage>1112</lpage>
<pub-id pub-id-type="doi">10.1016/j.jmb.2005.02.044</pub-id>
<pub-id pub-id-type="pmid">15854647</pub-id>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal">
<name>
<surname>Zhang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Rohde</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Tierling</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Jurkowski</surname>
<given-names>TP</given-names>
</name>
<name>
<surname>Bock</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Santacruz</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Ragozin</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Reinhardt</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Groth</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Walter</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Jeltsch</surname>
<given-names>A</given-names>
</name>
<article-title>DNA methylation analysis of chromosome 21 gene promoters at single base pair and single allele resolution</article-title>
<source>PLoS Genet</source>
<year>2009</year>
<volume>5</volume>
<issue>3</issue>
<fpage>e1000438</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pgen.1000438</pub-id>
<pub-id pub-id-type="pmid">19325872</pub-id>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="journal">
<name>
<surname>Brunner</surname>
<given-names>AL</given-names>
</name>
<name>
<surname>Johnson</surname>
<given-names>DS</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>SW</given-names>
</name>
<name>
<surname>Valouev</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Reddy</surname>
<given-names>TE</given-names>
</name>
<name>
<surname>Neff</surname>
<given-names>NF</given-names>
</name>
<name>
<surname>Anton</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Medina</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Nguyen</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Chiao</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Oyolu</surname>
<given-names>CB</given-names>
</name>
<name>
<surname>Schroth</surname>
<given-names>GP</given-names>
</name>
<name>
<surname>Absher</surname>
<given-names>DM</given-names>
</name>
<name>
<surname>Baker</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>RM</given-names>
</name>
<article-title>Distinct DNA methylation patterns characterize differentiated human embryonic stem cells and developing human fetal liver</article-title>
<source>Genome Res</source>
<year>2009</year>
<volume>19</volume>
<issue>6</issue>
<fpage>1044</fpage>
<lpage>1056</lpage>
<pub-id pub-id-type="doi">10.1101/gr.088773.108</pub-id>
<pub-id pub-id-type="pmid">19273619</pub-id>
</mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="journal">
<name>
<surname>Lister</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Pelizzola</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Dowen</surname>
<given-names>RH</given-names>
</name>
<name>
<surname>Hawkins</surname>
<given-names>RD</given-names>
</name>
<name>
<surname>Hon</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Tonti-Filippini</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Nery</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Ye</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Ngo</surname>
<given-names>QM</given-names>
</name>
<name>
<surname>Edsall</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Antosiewicz-Bourget</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Stewart</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Ruotti</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Millar</surname>
<given-names>AH</given-names>
</name>
<name>
<surname>Thomson</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Ecker</surname>
<given-names>JR</given-names>
</name>
<article-title>Human DNA methylomes at base resolution show widespread epigenomic differences</article-title>
<source>Nature</source>
<year>2009</year>
<volume>462</volume>
<issue>7271</issue>
<fpage>315</fpage>
<lpage>322</lpage>
<pub-id pub-id-type="doi">10.1038/nature08514</pub-id>
<pub-id pub-id-type="pmid">19829295</pub-id>
</mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="journal">
<name>
<surname>Taylor</surname>
<given-names>KH</given-names>
</name>
<name>
<surname>Kramer</surname>
<given-names>RS</given-names>
</name>
<name>
<surname>Davis</surname>
<given-names>WJ</given-names>
</name>
<name>
<surname>Guo</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Duff</surname>
<given-names>DJ</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Caldwell</surname>
<given-names>CW</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>H</given-names>
</name>
<article-title>Ultradeep bisulfite sequencing analysis of DNA methylation patterns in multiple gene promoters by 454 sequencing</article-title>
<source>Cancer Res</source>
<year>2007</year>
<volume>67</volume>
<issue>18</issue>
<fpage>8511</fpage>
<lpage>8518</lpage>
<pub-id pub-id-type="doi">10.1158/0008-5472.CAN-07-1016</pub-id>
<pub-id pub-id-type="pmid">17875690</pub-id>
</mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="other">
<name>
<surname>Kim</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Paik</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Nephew</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Kramer</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>TH</given-names>
</name>
<article-title>Predicting DNA methylation susceptibility using CpG flanking sequences</article-title>
<source>Pac Symp Biocomput</source>
<year>2008</year>
<fpage>315</fpage>
<lpage>326</lpage>
<pub-id pub-id-type="pmid">18229696</pub-id>
</mixed-citation>
</ref>
<ref id="B21">
<mixed-citation publication-type="journal">
<name>
<surname>Previti</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Harari</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Zwir</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Val</surname>
<given-names>CD</given-names>
</name>
<article-title>Profile analysis and prediction of tissue-specific CpG island methylation classes</article-title>
<source>BMC Bioinformatics</source>
<year>2009</year>
<volume>10</volume>
<fpage>116</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-10-116</pub-id>
<pub-id pub-id-type="pmid">19383127</pub-id>
</mixed-citation>
</ref>
<ref id="B22">
<mixed-citation publication-type="journal">
<name>
<surname>Breiman</surname>
<given-names>L</given-names>
</name>
<article-title>Random forests</article-title>
<source>Machine Learning</source>
<year>2001</year>
<volume>45</volume>
<fpage>5</fpage>
<lpage>32</lpage>
<pub-id pub-id-type="doi">10.1023/A:1010933404324</pub-id>
</mixed-citation>
</ref>
<ref id="B23">
<mixed-citation publication-type="other">
<collab>NIPS</collab>
<article-title>Feature selection challenge</article-title>
<year>2003</year>
<ext-link ext-link-type="uri" xlink:href="http://www.nipsfsc.ecs.soton.ac.uk">http://www.nipsfsc.ecs.soton.ac.uk</ext-link>
</mixed-citation>
</ref>
<ref id="B24">
<mixed-citation publication-type="book">
<name>
<surname>Cormen</surname>
<given-names>TH</given-names>
</name>
<name>
<surname>Leiserson</surname>
<given-names>CE</given-names>
</name>
<name>
<surname>Rivest</surname>
<given-names>RL</given-names>
</name>
<name>
<surname>Stein</surname>
<given-names>C</given-names>
</name>
<source>Introduction to Algorithms</source>
<year>2003</year>
<edition>2</edition>
<publisher-name>McGraw-Hill Science/Engineering/Math</publisher-name>
<ext-link ext-link-type="uri" xlink:href="http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20\&path=ASIN/0072970545">http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20\&path=ASIN/0072970545</ext-link>
</mixed-citation>
</ref>
<ref id="B25">
<mixed-citation publication-type="other">
<name>
<surname>Zhang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Rohde</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Tierling</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Jurkowski</surname>
<given-names>TP</given-names>
</name>
<name>
<surname>Bock</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Santacruz</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Ragozin</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Reinhardt</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Groth</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Walter</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Jeltsch</surname>
<given-names>A</given-names>
</name>
<article-title>amplicon 193 @ONLINE</article-title>
<year>2010</year>
<ext-link ext-link-type="uri" xlink:href="http://biochem.jacobs-university.de/name21/presentation/amplicon_summaries/193_amplicon_summary.html">http://biochem.jacobs-university.de/name21/presentation/amplicon_summaries/193_amplicon_summary.html</ext-link>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000A85 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000A85 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:3311103
   |texte=   A novel k-mer mixture logistic regression for methylation susceptibility modeling of CpG dinucleotides in human gene promoters
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:22536899" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021