Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Comparison of discriminative motif optimization using matrix and DNA shape-based models

Identifieur interne : 000271 ( Pmc/Corpus ); précédent : 000270; suivant : 000272

Comparison of discriminative motif optimization using matrix and DNA shape-based models

Auteurs : Shuxiang Ruan ; Gary D. Stormo

Source :

RBID : PMC:5840810

Abstract

Background

Transcription factor (TF) binding site specificity is commonly represented by some form of matrix model in which the positions in the binding site are assumed to contribute independently to the site’s activity. The independence assumption is known to be an approximation, often a good one but sometimes poor. Alternative approaches have been developed that use k-mers (DNA “words” of length k) to account for the non-independence, and more recently DNA structural parameters have been incorporated into the models. ChIP-seq data are often used to assess the discriminatory power of motifs and to compare different models. However, to measure the improvement due to using more complex models, one must compare to optimized matrix models.

Results

We describe a program “Discriminative Additive Model Optimization” (DAMO) that uses positive and negative examples, as in ChIP-seq data, and finds the additive position weight matrix (PWM) that maximizes the Area Under the Receiver Operating Characteristic Curve (AUROC). We compare to a recent study where structural parameters, serving as features in a gradient boosting classifier algorithm, are shown to improve the AUROC over JASPAR position frequency matrices (PFMs). In agreement with the previous results, we find that adding structural parameters gives the largest improvement, but most of the gain can be obtained by an optimized PWM and nearly all of the gain can be obtained with a di-nucleotide extension to the PWM.

Conclusion

To appropriately compare different models for TF bind sites, optimized models must be used. PWMs and their extensions are good representations of binding specificity for most TFs, and more complex models, including the incorporation of DNA shape features and gradient boosting classifiers, provide only moderate improvements for a few TFs.

Electronic supplementary material

The online version of this article (10.1186/s12859-018-2104-7) contains supplementary material, which is available to authorized users.


Url:
DOI: 10.1186/s12859-018-2104-7
PubMed: 29510689
PubMed Central: 5840810

Links to Exploration step

PMC:5840810

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Comparison of discriminative motif optimization using matrix and DNA shape-based models</title>
<author>
<name sortKey="Ruan, Shuxiang" sort="Ruan, Shuxiang" uniqKey="Ruan S" first="Shuxiang" last="Ruan">Shuxiang Ruan</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Stormo, Gary D" sort="Stormo, Gary D" uniqKey="Stormo G" first="Gary D." last="Stormo">Gary D. Stormo</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">29510689</idno>
<idno type="pmc">5840810</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5840810</idno>
<idno type="RBID">PMC:5840810</idno>
<idno type="doi">10.1186/s12859-018-2104-7</idno>
<date when="2018">2018</date>
<idno type="wicri:Area/Pmc/Corpus">000271</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000271</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Comparison of discriminative motif optimization using matrix and DNA shape-based models</title>
<author>
<name sortKey="Ruan, Shuxiang" sort="Ruan, Shuxiang" uniqKey="Ruan S" first="Shuxiang" last="Ruan">Shuxiang Ruan</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Stormo, Gary D" sort="Stormo, Gary D" uniqKey="Stormo G" first="Gary D." last="Stormo">Gary D. Stormo</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2018">2018</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p id="Par1">Transcription factor (TF) binding site specificity is commonly represented by some form of matrix model in which the positions in the binding site are assumed to contribute independently to the site’s activity. The independence assumption is known to be an approximation, often a good one but sometimes poor. Alternative approaches have been developed that use
<italic>k</italic>
-mers (DNA “words” of length
<italic>k</italic>
) to account for the non-independence, and more recently DNA structural parameters have been incorporated into the models. ChIP-seq data are often used to assess the discriminatory power of motifs and to compare different models. However, to measure the improvement due to using more complex models, one must compare to optimized matrix models.</p>
</sec>
<sec>
<title>Results</title>
<p id="Par2">We describe a program “Discriminative Additive Model Optimization” (DAMO) that uses positive and negative examples, as in ChIP-seq data, and finds the additive position weight matrix (PWM) that maximizes the Area Under the Receiver Operating Characteristic Curve (AUROC). We compare to a recent study where structural parameters, serving as features in a gradient boosting classifier algorithm, are shown to improve the AUROC over JASPAR position frequency matrices (PFMs). In agreement with the previous results, we find that adding structural parameters gives the largest improvement, but most of the gain can be obtained by an optimized PWM and nearly all of the gain can be obtained with a di-nucleotide extension to the PWM.</p>
</sec>
<sec>
<title>Conclusion</title>
<p id="Par3">To appropriately compare different models for TF bind sites, optimized models must be used. PWMs and their extensions are good representations of binding specificity for most TFs, and more complex models, including the incorporation of DNA shape features and gradient boosting classifiers, provide only moderate improvements for a few TFs.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (10.1186/s12859-018-2104-7) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Buratowski, S" uniqKey="Buratowski S">S Buratowski</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mcghee, Jd" uniqKey="Mcghee J">JD McGhee</name>
</author>
<author>
<name sortKey="Felsenfeld, G" uniqKey="Felsenfeld G">G Felsenfeld</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jones, Pa" uniqKey="Jones P">PA Jones</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stormo, Gd" uniqKey="Stormo G">GD Stormo</name>
</author>
<author>
<name sortKey="Zhao, Y" uniqKey="Zhao Y">Y Zhao</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pabo, Co" uniqKey="Pabo C">CO Pabo</name>
</author>
<author>
<name sortKey="Sauer, Rt" uniqKey="Sauer R">RT Sauer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="De Boer, Cg" uniqKey="De Boer C">CG de Boer</name>
</author>
<author>
<name sortKey="Hughes, Tr" uniqKey="Hughes T">TR Hughes</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rhee, Dy" uniqKey="Rhee D">DY Rhee</name>
</author>
<author>
<name sortKey="Cho, Dy" uniqKey="Cho D">DY Cho</name>
</author>
<author>
<name sortKey="Zhai, B" uniqKey="Zhai B">B Zhai</name>
</author>
<author>
<name sortKey="Slattery, M" uniqKey="Slattery M">M Slattery</name>
</author>
<author>
<name sortKey="Ma, L" uniqKey="Ma L">L Ma</name>
</author>
<author>
<name sortKey="Mintseris, J" uniqKey="Mintseris J">J Mintseris</name>
</author>
<author>
<name sortKey="Wong, Cy" uniqKey="Wong C">CY Wong</name>
</author>
<author>
<name sortKey="White, Kp" uniqKey="White K">KP White</name>
</author>
<author>
<name sortKey="Celniker, Se" uniqKey="Celniker S">SE Celniker</name>
</author>
<author>
<name sortKey="Przytycka, Tm" uniqKey="Przytycka T">TM Przytycka</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vaquerizas, Jm" uniqKey="Vaquerizas J">JM Vaquerizas</name>
</author>
<author>
<name sortKey="Kummerfeld, Sk" uniqKey="Kummerfeld S">SK Kummerfeld</name>
</author>
<author>
<name sortKey="Teichmann, Sa" uniqKey="Teichmann S">SA Teichmann</name>
</author>
<author>
<name sortKey="Luscombe, Nm" uniqKey="Luscombe N">NM Luscombe</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mertin, S" uniqKey="Mertin S">S Mertin</name>
</author>
<author>
<name sortKey="Mcdowall, Sg" uniqKey="Mcdowall S">SG McDowall</name>
</author>
<author>
<name sortKey="Harley, Vr" uniqKey="Harley V">VR Harley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kouzarides, T" uniqKey="Kouzarides T">T Kouzarides</name>
</author>
<author>
<name sortKey="Ziff, E" uniqKey="Ziff E">E Ziff</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hai, T" uniqKey="Hai T">T Hai</name>
</author>
<author>
<name sortKey="Curran, T" uniqKey="Curran T">T Curran</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Al Sarraj, A" uniqKey="Al Sarraj A">A Al-Sarraj</name>
</author>
<author>
<name sortKey="Day, Rm" uniqKey="Day R">RM Day</name>
</author>
<author>
<name sortKey="Thiel, G" uniqKey="Thiel G">G Thiel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Weirauch, Mt" uniqKey="Weirauch M">MT Weirauch</name>
</author>
<author>
<name sortKey="Yang, A" uniqKey="Yang A">A Yang</name>
</author>
<author>
<name sortKey="Albu, M" uniqKey="Albu M">M Albu</name>
</author>
<author>
<name sortKey="Cote, A" uniqKey="Cote A">A Cote</name>
</author>
<author>
<name sortKey="Montenegro Montero, A" uniqKey="Montenegro Montero A">A Montenegro-Montero</name>
</author>
<author>
<name sortKey="Drewe, P" uniqKey="Drewe P">P Drewe</name>
</author>
<author>
<name sortKey="Najafabadi, Hs" uniqKey="Najafabadi H">HS Najafabadi</name>
</author>
<author>
<name sortKey="Lambert, Sa" uniqKey="Lambert S">SA Lambert</name>
</author>
<author>
<name sortKey="Mann, I" uniqKey="Mann I">I Mann</name>
</author>
<author>
<name sortKey="Cook, K" uniqKey="Cook K">K Cook</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jolma, A" uniqKey="Jolma A">A Jolma</name>
</author>
<author>
<name sortKey="Kivioja, T" uniqKey="Kivioja T">T Kivioja</name>
</author>
<author>
<name sortKey="Toivonen, J" uniqKey="Toivonen J">J Toivonen</name>
</author>
<author>
<name sortKey="Cheng, L" uniqKey="Cheng L">L Cheng</name>
</author>
<author>
<name sortKey="Wei, G" uniqKey="Wei G">G Wei</name>
</author>
<author>
<name sortKey="Enge, M" uniqKey="Enge M">M Enge</name>
</author>
<author>
<name sortKey="Taipale, M" uniqKey="Taipale M">M Taipale</name>
</author>
<author>
<name sortKey="Vaquerizas, Jm" uniqKey="Vaquerizas J">JM Vaquerizas</name>
</author>
<author>
<name sortKey="Yan, J" uniqKey="Yan J">J Yan</name>
</author>
<author>
<name sortKey="Sillanpaa, Mj" uniqKey="Sillanpaa M">MJ Sillanpaa</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Berger, Mf" uniqKey="Berger M">MF Berger</name>
</author>
<author>
<name sortKey="Philippakis, Aa" uniqKey="Philippakis A">AA Philippakis</name>
</author>
<author>
<name sortKey="Qureshi, Am" uniqKey="Qureshi A">AM Qureshi</name>
</author>
<author>
<name sortKey="He, Fs" uniqKey="He F">FS He</name>
</author>
<author>
<name sortKey="Estep, Pw" uniqKey="Estep P">PW Estep</name>
</author>
<author>
<name sortKey="Bulyk, Ml" uniqKey="Bulyk M">ML Bulyk</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Valouev, A" uniqKey="Valouev A">A Valouev</name>
</author>
<author>
<name sortKey="Johnson, Ds" uniqKey="Johnson D">DS Johnson</name>
</author>
<author>
<name sortKey="Sundquist, A" uniqKey="Sundquist A">A Sundquist</name>
</author>
<author>
<name sortKey="Medina, C" uniqKey="Medina C">C Medina</name>
</author>
<author>
<name sortKey="Anton, E" uniqKey="Anton E">E Anton</name>
</author>
<author>
<name sortKey="Batzoglou, S" uniqKey="Batzoglou S">S Batzoglou</name>
</author>
<author>
<name sortKey="Myers, Rm" uniqKey="Myers R">RM Myers</name>
</author>
<author>
<name sortKey="Sidow, A" uniqKey="Sidow A">A Sidow</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, L" uniqKey="Zhang L">L Zhang</name>
</author>
<author>
<name sortKey="Martini, Gd" uniqKey="Martini G">GD Martini</name>
</author>
<author>
<name sortKey="Rube, Ht" uniqKey="Rube H">HT Rube</name>
</author>
<author>
<name sortKey="Kribelbauer, Jf" uniqKey="Kribelbauer J">JF Kribelbauer</name>
</author>
<author>
<name sortKey="Rastogi, C" uniqKey="Rastogi C">C Rastogi</name>
</author>
<author>
<name sortKey="Fitzpatrick, Vd" uniqKey="Fitzpatrick V">VD FitzPatrick</name>
</author>
<author>
<name sortKey="Houtman, Jc" uniqKey="Houtman J">JC Houtman</name>
</author>
<author>
<name sortKey="Bussemaker, Hj" uniqKey="Bussemaker H">HJ Bussemaker</name>
</author>
<author>
<name sortKey="Pufall, Ma" uniqKey="Pufall M">MA Pufall</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stormo, Gd" uniqKey="Stormo G">GD Stormo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Foat, Bc" uniqKey="Foat B">BC Foat</name>
</author>
<author>
<name sortKey="Morozov, Av" uniqKey="Morozov A">AV Morozov</name>
</author>
<author>
<name sortKey="Bussemaker, Hj" uniqKey="Bussemaker H">HJ Bussemaker</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ruan, S" uniqKey="Ruan S">S Ruan</name>
</author>
<author>
<name sortKey="Swamidass, Sj" uniqKey="Swamidass S">SJ Swamidass</name>
</author>
<author>
<name sortKey="Stormo, Gd" uniqKey="Stormo G">GD Stormo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stormo, Gd" uniqKey="Stormo G">GD Stormo</name>
</author>
<author>
<name sortKey="Schneider, Td" uniqKey="Schneider T">TD Schneider</name>
</author>
<author>
<name sortKey="Gold, L" uniqKey="Gold L">L Gold</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Weirauch, Mt" uniqKey="Weirauch M">MT Weirauch</name>
</author>
<author>
<name sortKey="Cote, A" uniqKey="Cote A">A Cote</name>
</author>
<author>
<name sortKey="Norel, R" uniqKey="Norel R">R Norel</name>
</author>
<author>
<name sortKey="Annala, M" uniqKey="Annala M">M Annala</name>
</author>
<author>
<name sortKey="Zhao, Y" uniqKey="Zhao Y">Y Zhao</name>
</author>
<author>
<name sortKey="Riley, Tr" uniqKey="Riley T">TR Riley</name>
</author>
<author>
<name sortKey="Saez Rodriguez, J" uniqKey="Saez Rodriguez J">J Saez-Rodriguez</name>
</author>
<author>
<name sortKey="Cokelaer, T" uniqKey="Cokelaer T">T Cokelaer</name>
</author>
<author>
<name sortKey="Vedenko, A" uniqKey="Vedenko A">A Vedenko</name>
</author>
<author>
<name sortKey="Talukder, S" uniqKey="Talukder S">S Talukder</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Benos, Pv" uniqKey="Benos P">PV Benos</name>
</author>
<author>
<name sortKey="Bulyk, Ml" uniqKey="Bulyk M">ML Bulyk</name>
</author>
<author>
<name sortKey="Stormo, Gd" uniqKey="Stormo G">GD Stormo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhao, Y" uniqKey="Zhao Y">Y Zhao</name>
</author>
<author>
<name sortKey="Stormo, Gd" uniqKey="Stormo G">GD Stormo</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jiang, B" uniqKey="Jiang B">B Jiang</name>
</author>
<author>
<name sortKey="Liu, Js" uniqKey="Liu J">JS Liu</name>
</author>
<author>
<name sortKey="Bulyk, Ml" uniqKey="Bulyk M">ML Bulyk</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhao, Y" uniqKey="Zhao Y">Y Zhao</name>
</author>
<author>
<name sortKey="Ruan, S" uniqKey="Ruan S">S Ruan</name>
</author>
<author>
<name sortKey="Pandey, M" uniqKey="Pandey M">M Pandey</name>
</author>
<author>
<name sortKey="Stormo, Gd" uniqKey="Stormo G">GD Stormo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Abe, N" uniqKey="Abe N">N Abe</name>
</author>
<author>
<name sortKey="Dror, I" uniqKey="Dror I">I Dror</name>
</author>
<author>
<name sortKey="Yang, L" uniqKey="Yang L">L Yang</name>
</author>
<author>
<name sortKey="Slattery, M" uniqKey="Slattery M">M Slattery</name>
</author>
<author>
<name sortKey="Zhou, T" uniqKey="Zhou T">T Zhou</name>
</author>
<author>
<name sortKey="Bussemaker, Hj" uniqKey="Bussemaker H">HJ Bussemaker</name>
</author>
<author>
<name sortKey="Rohs, R" uniqKey="Rohs R">R Rohs</name>
</author>
<author>
<name sortKey="Mann, Rs" uniqKey="Mann R">RS Mann</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rohs, R" uniqKey="Rohs R">R Rohs</name>
</author>
<author>
<name sortKey="Jin, X" uniqKey="Jin X">X Jin</name>
</author>
<author>
<name sortKey="West, Sm" uniqKey="West S">SM West</name>
</author>
<author>
<name sortKey="Joshi, R" uniqKey="Joshi R">R Joshi</name>
</author>
<author>
<name sortKey="Honig, B" uniqKey="Honig B">B Honig</name>
</author>
<author>
<name sortKey="Mann, Rs" uniqKey="Mann R">RS Mann</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rohs, R" uniqKey="Rohs R">R Rohs</name>
</author>
<author>
<name sortKey="West, Sm" uniqKey="West S">SM West</name>
</author>
<author>
<name sortKey="Sosinsky, A" uniqKey="Sosinsky A">A Sosinsky</name>
</author>
<author>
<name sortKey="Liu, P" uniqKey="Liu P">P Liu</name>
</author>
<author>
<name sortKey="Mann, Rs" uniqKey="Mann R">RS Mann</name>
</author>
<author>
<name sortKey="Honig, B" uniqKey="Honig B">B Honig</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhou, T" uniqKey="Zhou T">T Zhou</name>
</author>
<author>
<name sortKey="Shen, N" uniqKey="Shen N">N Shen</name>
</author>
<author>
<name sortKey="Yang, L" uniqKey="Yang L">L Yang</name>
</author>
<author>
<name sortKey="Abe, N" uniqKey="Abe N">N Abe</name>
</author>
<author>
<name sortKey="Horton, J" uniqKey="Horton J">J Horton</name>
</author>
<author>
<name sortKey="Mann, Rs" uniqKey="Mann R">RS Mann</name>
</author>
<author>
<name sortKey="Bussemaker, Hj" uniqKey="Bussemaker H">HJ Bussemaker</name>
</author>
<author>
<name sortKey="Gordan, R" uniqKey="Gordan R">R Gordan</name>
</author>
<author>
<name sortKey="Rohs, R" uniqKey="Rohs R">R Rohs</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhou, T" uniqKey="Zhou T">T Zhou</name>
</author>
<author>
<name sortKey="Yang, L" uniqKey="Yang L">L Yang</name>
</author>
<author>
<name sortKey="Lu, Y" uniqKey="Lu Y">Y Lu</name>
</author>
<author>
<name sortKey="Dror, I" uniqKey="Dror I">I Dror</name>
</author>
<author>
<name sortKey="Dantas Machado, Ac" uniqKey="Dantas Machado A">AC Dantas Machado</name>
</author>
<author>
<name sortKey="Ghane, T" uniqKey="Ghane T">T Ghane</name>
</author>
<author>
<name sortKey="Di Felice, R" uniqKey="Di Felice R">R Di Felice</name>
</author>
<author>
<name sortKey="Rohs, R" uniqKey="Rohs R">R Rohs</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chiu, Tp" uniqKey="Chiu T">TP Chiu</name>
</author>
<author>
<name sortKey="Yang, L" uniqKey="Yang L">L Yang</name>
</author>
<author>
<name sortKey="Zhou, T" uniqKey="Zhou T">T Zhou</name>
</author>
<author>
<name sortKey="Main, Bj" uniqKey="Main B">BJ Main</name>
</author>
<author>
<name sortKey="Parker, Sc" uniqKey="Parker S">SC Parker</name>
</author>
<author>
<name sortKey="Nuzhdin, Sv" uniqKey="Nuzhdin S">SV Nuzhdin</name>
</author>
<author>
<name sortKey="Tullius, Td" uniqKey="Tullius T">TD Tullius</name>
</author>
<author>
<name sortKey="Rohs, R" uniqKey="Rohs R">R Rohs</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mathelier, A" uniqKey="Mathelier A">A Mathelier</name>
</author>
<author>
<name sortKey="Xin, B" uniqKey="Xin B">B Xin</name>
</author>
<author>
<name sortKey="Chiu, Tp" uniqKey="Chiu T">TP Chiu</name>
</author>
<author>
<name sortKey="Yang, L" uniqKey="Yang L">L Yang</name>
</author>
<author>
<name sortKey="Rohs, R" uniqKey="Rohs R">R Rohs</name>
</author>
<author>
<name sortKey="Wasserman, Ww" uniqKey="Wasserman W">WW Wasserman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Patel, Ry" uniqKey="Patel R">RY Patel</name>
</author>
<author>
<name sortKey="Stormo, Gd" uniqKey="Stormo G">GD Stormo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ruan, S" uniqKey="Ruan S">S Ruan</name>
</author>
<author>
<name sortKey="Stormo, Gd" uniqKey="Stormo G">GD Stormo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Consortium, Ep" uniqKey="Consortium E">EP Consortium</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kent, Wj" uniqKey="Kent W">WJ Kent</name>
</author>
<author>
<name sortKey="Sugnet, Cw" uniqKey="Sugnet C">CW Sugnet</name>
</author>
<author>
<name sortKey="Furey, Ts" uniqKey="Furey T">TS Furey</name>
</author>
<author>
<name sortKey="Roskin, Km" uniqKey="Roskin K">KM Roskin</name>
</author>
<author>
<name sortKey="Pringle, Th" uniqKey="Pringle T">TH Pringle</name>
</author>
<author>
<name sortKey="Zahler, Am" uniqKey="Zahler A">AM Zahler</name>
</author>
<author>
<name sortKey="Haussler, D" uniqKey="Haussler D">D Haussler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Spiro, C" uniqKey="Spiro C">C Spiro</name>
</author>
<author>
<name sortKey="Bazett Jones, Dp" uniqKey="Bazett Jones D">DP Bazett-Jones</name>
</author>
<author>
<name sortKey="Wu, X" uniqKey="Wu X">X Wu</name>
</author>
<author>
<name sortKey="Mcmurray, Ct" uniqKey="Mcmurray C">CT McMurray</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Orenstein, Y" uniqKey="Orenstein Y">Y Orenstein</name>
</author>
<author>
<name sortKey="Shamir, R" uniqKey="Shamir R">R Shamir</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">29510689</article-id>
<article-id pub-id-type="pmc">5840810</article-id>
<article-id pub-id-type="publisher-id">2104</article-id>
<article-id pub-id-type="doi">10.1186/s12859-018-2104-7</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Methodology Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Comparison of discriminative motif optimization using matrix and DNA shape-based models</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Ruan</surname>
<given-names>Shuxiang</given-names>
</name>
<address>
<email>sruan@wustl.edu</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<contrib-id contrib-id-type="orcid">http://orcid.org/0000-0001-6896-1850</contrib-id>
<name>
<surname>Stormo</surname>
<given-names>Gary D.</given-names>
</name>
<address>
<email>stormo@wustl.edu</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2355 7002</institution-id>
<institution-id institution-id-type="GRID">grid.4367.6</institution-id>
<institution>Department of Genetics and Edison Family Center for Genome Sciences and Systems Biology,</institution>
<institution>Washington University School of Medicine,</institution>
</institution-wrap>
St. Louis, 63110 USA</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>6</day>
<month>3</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>6</day>
<month>3</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="collection">
<year>2018</year>
</pub-date>
<volume>19</volume>
<elocation-id>86</elocation-id>
<history>
<date date-type="received">
<day>9</day>
<month>10</month>
<year>2017</year>
</date>
<date date-type="accepted">
<day>1</day>
<month>3</month>
<year>2018</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s). 2018</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p id="Par1">Transcription factor (TF) binding site specificity is commonly represented by some form of matrix model in which the positions in the binding site are assumed to contribute independently to the site’s activity. The independence assumption is known to be an approximation, often a good one but sometimes poor. Alternative approaches have been developed that use
<italic>k</italic>
-mers (DNA “words” of length
<italic>k</italic>
) to account for the non-independence, and more recently DNA structural parameters have been incorporated into the models. ChIP-seq data are often used to assess the discriminatory power of motifs and to compare different models. However, to measure the improvement due to using more complex models, one must compare to optimized matrix models.</p>
</sec>
<sec>
<title>Results</title>
<p id="Par2">We describe a program “Discriminative Additive Model Optimization” (DAMO) that uses positive and negative examples, as in ChIP-seq data, and finds the additive position weight matrix (PWM) that maximizes the Area Under the Receiver Operating Characteristic Curve (AUROC). We compare to a recent study where structural parameters, serving as features in a gradient boosting classifier algorithm, are shown to improve the AUROC over JASPAR position frequency matrices (PFMs). In agreement with the previous results, we find that adding structural parameters gives the largest improvement, but most of the gain can be obtained by an optimized PWM and nearly all of the gain can be obtained with a di-nucleotide extension to the PWM.</p>
</sec>
<sec>
<title>Conclusion</title>
<p id="Par3">To appropriately compare different models for TF bind sites, optimized models must be used. PWMs and their extensions are good representations of binding specificity for most TFs, and more complex models, including the incorporation of DNA shape features and gradient boosting classifiers, provide only moderate improvements for a few TFs.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (10.1186/s12859-018-2104-7) contains supplementary material, which is available to authorized users.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Motif</kwd>
<kwd>Motif optimization</kwd>
<kwd>ChIP-seq</kwd>
<kwd>Position weight matrix</kwd>
<kwd>DNA shape features</kwd>
</kwd-group>
<funding-group>
<award-group>
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/100000051</institution-id>
<institution>National Human Genome Research Institute</institution>
</institution-wrap>
</funding-source>
<award-id>HG000249</award-id>
<award-id>T32 HG000045</award-id>
<principal-award-recipient>
<name>
<surname>Ruan</surname>
<given-names>Shuxiang</given-names>
</name>
</principal-award-recipient>
</award-group>
</funding-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2018</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Background</title>
<p id="Par4">The interaction between proteins and genomic DNA plays a crucial role in many important cellular processes. For instance, the RNA polymerase interacts with DNA during transcription and uses it as a template for RNA synthesis [
<xref ref-type="bibr" rid="CR1">1</xref>
] and the formation of nucleosomes involves histones and DNA binding together to form a well-defined three-dimensional structure [
<xref ref-type="bibr" rid="CR2">2</xref>
]. Some epigenetic modifications such as DNA methylation, which alter DNA accessibility and chromatin structures, are carried out by the DNA methyltransferase and other proteins that mainly target CpG di-nucleotides [
<xref ref-type="bibr" rid="CR3">3</xref>
]. The sequence-specific transcription factors (TFs) are a special class of DNA-binding proteins that recognize specific DNA sequences and primarily regulate gene expression [
<xref ref-type="bibr" rid="CR4">4</xref>
,
<xref ref-type="bibr" rid="CR5">5</xref>
]. In most species, they constitute between 5% and 10% of all genes [
<xref ref-type="bibr" rid="CR6">6</xref>
<xref ref-type="bibr" rid="CR8">8</xref>
]. Although some prominent TFs, including Sox [
<xref ref-type="bibr" rid="CR9">9</xref>
], AP-1 [
<xref ref-type="bibr" rid="CR10">10</xref>
,
<xref ref-type="bibr" rid="CR11">11</xref>
] and Sp1 [
<xref ref-type="bibr" rid="CR12">12</xref>
], have been studied extensively, the binding specificities of most TFs are poorly documented even in many well-studied species [
<xref ref-type="bibr" rid="CR13">13</xref>
]. In recent years, several high-throughput experimental techniques, such as high-throughput SELEX (HT-SELEX), protein-binding microarrays (PBMs) and ChIP-seq, have been developed to estimate the relative binding affinities of large numbers of DNA sequences both in vitro and in vivo [
<xref ref-type="bibr" rid="CR14">14</xref>
<xref ref-type="bibr" rid="CR17">17</xref>
]. These techniques have greatly accelerated the study of TF binding specificity [
<xref ref-type="bibr" rid="CR4">4</xref>
], but the analysis of their results proves challenging and requires the development of novel TF binding models and motif discovery algorithms.</p>
<p id="Par5">The specificity of TFs is commonly represented by matrix models, of which there are several varieties [
<xref ref-type="bibr" rid="CR18">18</xref>
]. In probabilistic models, such as position frequency matrices (PFMs), the matrix elements are the probability of each base occurring at each position in the binding site and the probability of a specific site is the product of those probabilities for the base at each position. In a more general position weight matrix (PWM), the elements of the matrix are added together to get the score for a specific binding site. Trained on quantitative binding data, regression methods can be used to obtain matrix elements that correspond to energy contributions of each base at each position [
<xref ref-type="bibr" rid="CR19">19</xref>
<xref ref-type="bibr" rid="CR21">21</xref>
]. All matrix models have in common the assumption that the positions of a binding site contribute independently to its activity, an assumption that is often a good approximation but not always [
<xref ref-type="bibr" rid="CR22">22</xref>
<xref ref-type="bibr" rid="CR24">24</xref>
]. More complex models utilize
<italic>k</italic>
-mers, short DNA words of length
<italic>k</italic>
, to account for non-independence between positions [
<xref ref-type="bibr" rid="CR21">21</xref>
,
<xref ref-type="bibr" rid="CR22">22</xref>
,
<xref ref-type="bibr" rid="CR25">25</xref>
<xref ref-type="bibr" rid="CR27">27</xref>
]. Recently there have been several studies showing that variations in DNA shape can influence TF binding affinity, and that those contributions may involve non-independence between positions [
<xref ref-type="bibr" rid="CR28">28</xref>
<xref ref-type="bibr" rid="CR31">31</xref>
]. DNAshape is a program that predicts DNA structural features in a high-throughput manner based on Monte Carlo simulations of DNA fragments [
<xref ref-type="bibr" rid="CR32">32</xref>
]. The Genome Browser for DNA shape annotations (GBshape), a database based on DNAshape and related computational tools, provides DNA shape feature predictions for a range of organisms [
<xref ref-type="bibr" rid="CR33">33</xref>
]. Those resources were used in a recent study where motif models using gradient boosting classifiers were trained to differentiate ChIP-seq peaks from random background sequences, showing that adding DNA shape features can significantly improve the accuracy of the classifiers [
<xref ref-type="bibr" rid="CR34">34</xref>
].</p>
<p id="Par6">In this report, we replicate the results of Mathelier et al. [
<xref ref-type="bibr" rid="CR34">34</xref>
] and we compare the performance of the gradient boosting classifiers to simple PWMs generated by DAMO, a perceptron-based optimization method that finds the optimal PWM with the highest area under the receiver operating characteristic curve (AUROC). DAMO is similar to our previously described DiMO [
<xref ref-type="bibr" rid="CR35">35</xref>
], but where DiMO provided optimized PFMs, DAMO provides optimized PWMs which have recently been shown to avoid the inherent limitations of probabilistic models [
<xref ref-type="bibr" rid="CR36">36</xref>
]. DAMO also allows for the inclusion of adjacent di-nucleotides if the independence assumption provides poor performance. Our results confirm that adding DNA shape features in a gradient boosting classifier does significantly improve the performance over the initial JASPAR PFMs, but also show that most of the improvement can be obtained with optimal PWMs, and adding di-nucleotide contributions performs nearly as well as the much more complex gradient boosting classifiers including shape parameters.</p>
</sec>
<sec id="Sec2">
<title>Methods</title>
<sec id="Sec3">
<title>JASPAR PFMs</title>
<p id="Par7">Following the study of Mathelier et al. [
<xref ref-type="bibr" rid="CR34">34</xref>
], we obtained 75 JASPAR PFMs that can be associated with ChIP-seq datasets generated by the ENCODE project [
<xref ref-type="bibr" rid="CR37">37</xref>
]. (Their work included 76 PFMs but one of those (ID: MA0133.1) is no longer available in the March 2017 JASPAR CORE collection).</p>
</sec>
<sec id="Sec4">
<title>ChIP-seq datasets</title>
<p id="Par8">We used the same ChIP-seq datasets analyzed by Mathelier et al. [
<xref ref-type="bibr" rid="CR34">34</xref>
] and downloaded 396 uniformly processed human ENCODE ChIP-seq datasets associated with the 75 JASPAR PFMs from the UCSC Genome Browser [
<xref ref-type="bibr" rid="CR38">38</xref>
]. For each ChIP-seq peak, we retrieved the 100 bp sequence centered on the point-source of the peak from the human genome assembly hg19, which serves as a positive sequence for training and testing TF binding models. For each positive sequence, we also constructed a negative sequence, which is the 100 bp sequence 100 bp downstream from the positive sequence in the human genome. We also tested performance when the negative sequences were obtained from 5000 bp downstream. For each ChIP-seq dataset, we constructed 10 training and 10 testing sets for 10-fold cross-validation, where each training set is 9 times the size of a testing set. We also tested performance when the training set and testing set were each only 10% of the total data.</p>
</sec>
<sec id="Sec5">
<title>DNA shape features</title>
<p id="Par9">We retrieved the same DNA shape features used by Mathelier et al. [
<xref ref-type="bibr" rid="CR34">34</xref>
] from GBshape [
<xref ref-type="bibr" rid="CR33">33</xref>
]. The features include the helix twist (HelT), the minor groove width (MGW), the propeller twist (ProT), the roll (Roll), and the corresponding second-order shape features. These features were only used for training and testing models designated with “+ shape”.</p>
</sec>
<sec id="Sec6">
<title>Motif optimization algorithms evaluated</title>
<p id="Par10">Table 
<xref rid="Tab1" ref-type="table">1</xref>
lists the motif optimization algorithms evaluated in this study. Of the 9 algorithms, 5 are based on the DNAshapedTFBS program, which trains a binary classifier using the gradient boosting algorithm [
<xref ref-type="bibr" rid="CR34">34</xref>
]. These 5 algorithms differ in two aspects: 1) how the feature vector is encoded, and 2) whether the feature vector includes DNA shape features. In the 4bit encoding, which was used by Mathelier et al., A is encoded as 1000, T as 0100, G as 0010, and C as 0001. In JASPAR + shape and DAMO + shape the sequence information is included simply as a score from a matrix model and in Shape_only it is not included at all. The remaining 4 algorithms are simple matrix models from the single nucleotide and adjacent di-nucleotide mode of the DAMO program, the JASPAR PFMs and PFMs obtained from the DAMO PWMs (which are equivalent to the original DiMO PFMs). DAMO is a Python implementation of the DiMO program, which is based on perceptron learning and finds the optimal PWM by maximizing its AUROC score [
<xref ref-type="bibr" rid="CR35">35</xref>
]. The original DiMO program outputs a normalized PFM derived from the optimal PWM, even though it uses a PWM internally for optimization. Because PWMs do not have the limitations of probabilistic PFMs [
<xref ref-type="bibr" rid="CR36">36</xref>
], we configured the DAMO program to output the optimal PWM directly and we also allow for adjacent di-nucleotides to be included to capture non-independent contributions from adjacent positions in the binding sites [
<xref ref-type="bibr" rid="CR21">21</xref>
,
<xref ref-type="bibr" rid="CR27">27</xref>
].
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>Descriptions of the motif optimization algorithms evaluated</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th>Algorithm</th>
<th>Output</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>JASPAR</td>
<td>Position frequency matrix</td>
<td>PFMs from the JASPAR database</td>
</tr>
<tr>
<td>DAMO</td>
<td>Position weight matrix</td>
<td>Modified DiMO program that outputs PWMs instead of PFMs</td>
</tr>
<tr>
<td>DAMO_PFM</td>
<td>Position frequency matrix</td>
<td>PFMs derived from the DAMO single-nucleotide PWMs</td>
</tr>
<tr>
<td>DAMO_dinuc</td>
<td>Position weight matrix</td>
<td>The adjacent di-nucleotide mode of DAMO</td>
</tr>
<tr>
<td>DNAshapedTFBS_4bit</td>
<td>Gradient boosting classifier</td>
<td>DNAshapedTFBS with 4-bit encoding</td>
</tr>
<tr>
<td>DNAshapedTFBS_4bit + shape</td>
<td>Gradient boosting classifier</td>
<td>DNAshapedTFBS_4bit plus DNA shape features</td>
</tr>
<tr>
<td>Shape_only</td>
<td>Gradient boosting classifier</td>
<td>The feature vector contains only DNA shape features</td>
</tr>
<tr>
<td>JASPAR + shape</td>
<td>Gradient boosting classifier</td>
<td>JASPAR PFM score plus DNA shape features</td>
</tr>
<tr>
<td>DAMO + shape</td>
<td>Gradient boosting classifier</td>
<td>DAMO single-nucleotide PWM score plus DNA shape features</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
<sec id="Sec7">
<title>Training and testing binding models</title>
<p id="Par11">The training and testing procedures are based on the methods described by Mathelier et al. [
<xref ref-type="bibr" rid="CR34">34</xref>
].</p>
<sec id="Sec8">
<title>Preprocessing</title>
<p id="Par12">We first use the JASPAR PFM to scan all the positive and negative sequences in both the training and testing set, and identify the best binding site, which has the same length as the PFM, within each sequence. Then we use the DNAshapedTFBS program to extract the DNA sequence of each best site and the corresponding DNA shape features. The sequences of the best sites, instead of the full-length positive and negative sequences, are used in the following steps for training and testing TF binding models. This means that we are not testing motif discovery algorithms, because the positive and negative sites are predefined. Rather, we are testing how well models of different complexity can perform classification after optimization for that task (except for the original JASPAR PFMs).</p>
</sec>
<sec id="Sec9">
<title>Training</title>
<p id="Par13">The training procedure depends on the motif optimization algorithm. For the DNAshapedTFBS-based methods, we first construct, for each best site in the training set, a feature vector containing its JASPAR PFM score, the DAMO PWM score or encoded DNA sequence. If the method takes account of the DNA shape features, the feature vector also contains the normalized values of the 8 DNA shape features at each position. Then a gradient boosting classifier is trained on the positive and negative feature vectors. For DAMO, the sequences of the positive and negative sites are directly fed into the program along with the JASPAR PFM, which serves as a seed matrix. DAMO then finds the optimal PWM that maximizes the AUROC on the training set, using the perceptron training algorithm. The perceptron training, described in detail previously [
<xref ref-type="bibr" rid="CR35">35</xref>
], updates the PWM by error correction on the mis-classified sites, those in the positive set with lower scores than the best negative site, and the negative sites with scores higher than the lowest positive site, and training proceeds until convergence. The sequences can be encoded using adjacent dinucleotides to capture non-independent contributions between those positions [
<xref ref-type="bibr" rid="CR18">18</xref>
,
<xref ref-type="bibr" rid="CR21">21</xref>
,
<xref ref-type="bibr" rid="CR27">27</xref>
]. The DAMO_PFM model is obtained by considering the DAMO PWMs scores as energies and converting to normalized probabilities (as in the original DiMO approach [
<xref ref-type="bibr" rid="CR35">35</xref>
]).</p>
</sec>
<sec id="Sec10">
<title>Testing</title>
<p id="Par14">The testing procedure is the same for all the algorithms. The trained gradient boosting classifiers and the different PFM and PWM models are used to score all the positive and negative sites in the testing set. Those scores are used to rank the sites and compute the area under the precision recall curve (AUPRC) and the area under the receiver operating characteristic curve (AUROC) based on the true labels of the ranked sites. We report the mean values and standard deviations from ten-fold cross-validation tests.</p>
</sec>
</sec>
</sec>
<sec id="Sec11">
<title>Results</title>
<p id="Par15">For each algorithm, the mean and standard deviation of the AUPRC scores for the 396 samples, on both training and testing sets, are summarized in Table
<xref rid="Tab2" ref-type="table">2</xref>
. (AUROC scores are reported in Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S1.) As reported previously, adding shape parameters to the JASPAR PFMs significantly improves the AUPRC [
<xref ref-type="bibr" rid="CR34">34</xref>
]. We obtain a mean increase of 0.034, equivalent to adding shape parameters to the DAMO scores, and larger than the improvement of any other model. But just optimizing the PFM, with the DAMO_PFM model, captures about 35% of the total improvement. The PWM obtained by DAMO provides nearly 60% of the total improvement and demonstrates the inherent advantage of a PWM model over a PFM model as we showed previously [
<xref ref-type="bibr" rid="CR36">36</xref>
]. Adding parameters for adjacent di-nucleotides captures nearly 80% of the improvement over the JASPAR PFM model. All of those models are simple matrix models where the positions contribute independently to the total score of a site, except that in the DAMO_dinuc model the adjacent dinucleotides contribute additively to the score. The gradient boosting classifiers, which use an ensemble of decision trees for classification, are complex non-linear models even without the shape parameters. The 4bit model, whose input is only the sequence, increases performance to 88% of the best model and adding shape parameters to the 4bit classifier increases performance to essentially the same as JASPAR + shape and DAMO + shape. Interestingly, the Shape_only model does nearly as well as any other gradient boosting classifier model, indicating that the shape parameters inherently contain the sequence information (see Discussion). We also tested performance when the negative sequence set was selected at a distance of 5000 bp instead of 100 bp downstream (Additional file
<xref rid="MOESM2" ref-type="media">2</xref>
: Tables S2). In that case the performance on both the training and testing sets was increased, probably because sequences in the negative set that are only 100 bp downstream of the ChIP-seq peak may also contain true binding sites. But the overall results are consistent with the initial findings. While the increase in AUPRC of JASPAR + shape over JASPAR PFMs is now 0.065, nearly 70% of that increase is captured using the DAMO PWMs, and nearly 90% is captured by the DAMO_dinuc model.
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>Mean AUPRC (and standard deviation) on ChIP-seq data</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th>Algorithm</th>
<th>Training</th>
<th>Testing</th>
</tr>
</thead>
<tbody>
<tr>
<td>JASPAR</td>
<td>0.812 (0.132)</td>
<td>0.812 (0.132)</td>
</tr>
<tr>
<td>DAMO</td>
<td>0.834 (0.119)</td>
<td>0.832 (0.120)</td>
</tr>
<tr>
<td>DAMO_PFM</td>
<td>0.825 (0.120)</td>
<td>0.824 (0.122)</td>
</tr>
<tr>
<td>DAMO_dinuc</td>
<td>0.844 (0.114)</td>
<td>0.839 (0.119)</td>
</tr>
<tr>
<td>DNAshapedTFBS_4bit</td>
<td>0.854 (0.105)</td>
<td>0.842 (0.115)</td>
</tr>
<tr>
<td>DNAshapedTFBS_4bit + shape</td>
<td>0.875 (0.090)</td>
<td>0.845 (0.113)</td>
</tr>
<tr>
<td>Shape_only</td>
<td>0.871 (0.089)</td>
<td>0.840 (0.112)</td>
</tr>
<tr>
<td>JASPAR + shape</td>
<td>0.878 (0.089)</td>
<td>0.846 (0.112)</td>
</tr>
<tr>
<td>DAMO + shape</td>
<td>0.879 (0.090)</td>
<td>0.846 (0.113)</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p id="Par16">The gap between the training and testing scores is a measure of overfitting for a model, which generally corresponds to the complexity of the model. Specifically, the highly non-linear DNAshapedTFBS-based models, with their ensemble of decision trees, have larger gaps than the PFM and PWM models, which are linear models (Fig.
<xref rid="Fig1" ref-type="fig">1</xref>
). In fact, the training and testing scores of the linear models are nearly identical. The largest gaps, > 0.03, are associated with the DNAshapedTFBS-based models with DNA shape features, presumably because their feature vectors are most complex. This effect also shows up in the sensitivity of the complex models to the size of the training data. Additional file
<xref rid="MOESM3" ref-type="media">3</xref>
: Table S3 shows results for several of the models when trained on only 1/10 of the data, the same as the testing sample size, and much smaller than the normal training on 9/10 of the data. All of the models, except for the JASPAR PFMs which are untrained, increase the AUPRC and AUROC scores on the training data and decrease those scores on the testing data. On those small training sets the DAMO PWMs score as well as the more complex models on the testing sets.
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>Differences in AUPRC between training and testing datasets. For each model the differences are shown for each of the 396 datasets. The box represents 1st, 2nd (median indicated with line) and 3rd quartiles and the whiskers represent 1.5 interquartile range (IQR) below or above 1st or 3rd quartiles</p>
</caption>
<graphic xlink:href="12859_2018_2104_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
<p id="Par17">Figure
<xref rid="Fig2" ref-type="fig">2</xref>
compares graphically the results reported in Table
<xref rid="Tab2" ref-type="table">2</xref>
, with points for each of the 396 ChIP-seq datasets. The top eight panels have the JASPAR + shape AUPRC results on the vertical axis and each of the other eight models on the horizontal axis. Consistent with Table
<xref rid="Tab2" ref-type="table">2</xref>
, there is a large improvement from the JASPAR scores alone (Fig.
<xref rid="Fig2" ref-type="fig">2a</xref>
). For the DAMO PWMs (Fig.
<xref rid="Fig2" ref-type="fig">2b</xref>
) there are many fewer data sets with large improvement. The DAMO PFMs are much better than the JASPAR PFMs, as expected because those JASPAR PFMs have not been optimized for this task, but are not as good as the DAMO PWMs showing the inherent limitations of PFM models [
<xref ref-type="bibr" rid="CR36">36</xref>
]. The DAMO_dinuc model (Fig.
<xref rid="Fig2" ref-type="fig">2d</xref>
) has very few datasets with large improvements. In each of the other models, which come from the gradient boosting classifier (Fig.
<xref rid="Fig2" ref-type="fig">2e</xref>
-
<xref rid="Fig2" ref-type="fig">h</xref>
), the data points cluster near the diagonal, indicating that the difference between the two scores of the same sample is very small. The bottom row of panels in Fig.
<xref rid="Fig2" ref-type="fig">2</xref>
show the score differences, in ascending order, between similar models. Note the differences in scale on the vertical axes. In most cases there are very few datasets with differences > 0.02, except for the comparison of the JASPAR scores and JASPAR + shape, where many datasets show improvements > 0.05 and a few are > 0.10. Most TFs have multiple associated ChIP-seq data sets (median of three), and the mean difference for every TF are shown in Additional file
<xref rid="MOESM4" ref-type="media">4</xref>
: Table S4. Except for the comparison of JASPAR + shape with JASPAR, very few of the TFs have mean differences > 0.02, suggesting that feature vectors based only on sequence, optimized for AUROC scores but without including structure parameters, capture essentially all of the discriminatory power of the motifs. However, the motifs using the ensemble of decision trees do contain higher-order information beyond that available to simple PWMs, most of which is captured by using di-nucleotide extensions of PWMs.
<fig id="Fig2">
<label>Fig. 2</label>
<caption>
<p>Comparison of AUPRC scores for different models.
<bold>a</bold>
-
<bold>h</bold>
JASPAR + shape on vertical axis and each of the other eight models on the horizontal axis.
<bold>i</bold>
Difference in AUPRC for DAMO PWM with and without di-nucleotides.
<bold>j</bold>
Difference in AUPRC for DAMO PWM and DAMO PFM.
<bold>k</bold>
-
<bold>l</bold>
Differences with adding shape features to the 4bit model and the JASPAR PFM model</p>
</caption>
<graphic xlink:href="12859_2018_2104_Fig2_HTML" id="MO2"></graphic>
</fig>
</p>
</sec>
<sec id="Sec12">
<title>Discussion</title>
<p id="Par18">Our results confirm that adding DNA shape features significantly improves the performance of JASPAR PFMs, with a mean increase of 0.034 in AUPRC. Simply optimizing the PFMs for the task of maximizing AUROC captures over one-third of that difference. An optimized PWM captures most of the improvement, and adding di-nucleotide parameters helps further. The gradient boosting approach increased AUPRC slightly more, as did adding shape parameters, but on the vast majority of datasets the differences between the simple PWM models and more complex models are small, consistent with previous work showing that optimized PWMs are often good approximations for TF specificity [
<xref ref-type="bibr" rid="CR22">22</xref>
<xref ref-type="bibr" rid="CR24">24</xref>
]. Including DNA shape features further increases the number of parameters in a binding model, which increases the cost of training and may result in overfitting. The fact that the performance of DAMO_dinuc is similar to the non-linear gradient boosting classifiers indicates that the majority of the deviations from the assumption of position independence can be captured by adjacent di-nucleotide interactions.</p>
<p id="Par19">The success of the PWMs does not mean that the structure of DNA plays no role in binding site recognition. In fact, there are good examples showing that it does [
<xref ref-type="bibr" rid="CR28">28</xref>
<xref ref-type="bibr" rid="CR30">30</xref>
,
<xref ref-type="bibr" rid="CR39">39</xref>
]. All of the models based on sequence features alone are agnostic with respect to the mechanisms of specificity. They only describe mathematically how much each base at each position contributes to binding specificity, or in the case of higher-order contributions, how useful those are in discriminating the positive and negative training sets. Because DNA structure depends on sequence, redundancies arise when using both types of parameters together. In fact, given a sufficiently long sequence (such as a genome) encoded solely with structure parameters, a good compression algorithm could reconstruct the sequence exactly, demonstrating that the structure information contains within it the sequence information. This is also clear from our results with the Shape_only model. Certainly interactions between the TF and the bases of the DNA sequence are the primary contributions to binding affinity. But encoding the sequence using only structural parameters performs nearly as well as using input vectors including both sequence and structure because the sequence is redundant given the structure.</p>
<p id="Par20">We advocate using the most efficient algorithm, with the least number of parameters, that obtains the maximum fit to quantitative data, or the optimal discrimination between positive and negative data sets. This reduces the complexity of the model to only the non-redundant parameters, minimizes the training time and reduces the susceptibility to over-fitting. Those optimal parameters, including higher-order interactions as needed, can be used to infer the mechanism of binding. For example, if dinucleotides are required to obtain the best fit, and the specific dinucleotides that correspond to higher affinity (or better discrimination) are those correlated with a narrow minor groove, then one could infer the TF prefers binding to DNA structures with narrow minor grooves. But doing this after the mathematically optimal parameters are obtained removes redundancies in the feature vectors used for training which could confound interpretation.</p>
<p id="Par21">Discrimination of binding sites from ChIP-seq data, such as with AUPRC or AUROC scores, is a popular method for assessing the accuracy of TF motifs [
<xref ref-type="bibr" rid="CR40">40</xref>
]. However, those scores are inherently rank based and miss other important aspects of binding activity such as the relative binding affinity between different binding sites [
<xref ref-type="bibr" rid="CR17">17</xref>
,
<xref ref-type="bibr" rid="CR20">20</xref>
]. Therefore PWMs, and other motifs, obtained simply by maximizing AUPRC or AUROC scores should not be used as predictors of relative binding affinity. To do that they should be rescaled by reference to some external binding data, preferably from quantitative in vitro experiments. Alternatively, one can assume that the majority of peaks contain binding sites within some constrained range of binding affinity, perhaps within 100-fold of the maximum, and use that assumption to scale the PWM to approximate binding energies [
<xref ref-type="bibr" rid="CR20">20</xref>
].</p>
</sec>
<sec id="Sec13">
<title>Conclusions</title>
<p id="Par22">To address the issue of whether matrix models, which assume independent contributions across the positions of the binding site, are adequate representations of specificity requires appropriate comparisons. To compare complex models that have been optimized for a specific task, such as maximizing AUROC, to PFM/PWM models that have been obtained from other types of data or for other tasks, confounds the comparison between the type of model and the method for obtaining the model parameters. We show that simple PWM models, when optimized for maximum AUROC, perform nearly as well as more complex non-linear models. We also show the advantages of PWMs over PFMs, and that including adjacent dinucleotides in the additive PWM model can further enhance its performance on at least some of the datasets. While DNA structure certainly contributes to binding affinity, at least in some cases, we advocate for finding mathematically optimal models that are simple and efficient but agnostic as to mechanism, and then inferring the mechanisms that contribute to binding affinity as further steps in the analysis.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Additional files</title>
<sec id="Sec14">
<p>
<supplementary-material content-type="local-data" id="MOESM1">
<media xlink:href="12859_2018_2104_MOESM1_ESM.docx">
<label>Additional file 1:</label>
<caption>
<p>
<bold>Table S1.</bold>
Mean AUROC (and standard deviation) on ChIP-seq data. (DOCX 13 kb)</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="MOESM2">
<media xlink:href="12859_2018_2104_MOESM2_ESM.docx">
<label>Additional file 2:</label>
<caption>
<p>
<bold>Table S2.</bold>
Effect of Method for Generating Negative Sequences on Training and Testing Scores. (DOCX 13 kb)</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="MOESM3">
<media xlink:href="12859_2018_2104_MOESM3_ESM.docx">
<label>Additional file 3:</label>
<caption>
<p>
<bold>Table S3.</bold>
Scores for the motif optimization algorithms on ChIP-seq data with small training sets. (DOCX 12 kb)</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="MOESM4">
<media xlink:href="12859_2018_2104_MOESM4_ESM.xlsx">
<label>Additional file 4:</label>
<caption>
<p>
<bold>Table S4.</bold>
AUPRC and AUROC differences between model pairs by TF. (XLSX 32 kb)</p>
</caption>
</media>
</supplementary-material>
</p>
</sec>
</sec>
</body>
<back>
<fn-group>
<fn>
<p>
<bold>Electronic supplementary material</bold>
</p>
<p>The online version of this article (10.1186/s12859-018-2104-7) contains supplementary material, which is available to authorized users.</p>
</fn>
</fn-group>
<ack>
<title>Acknowledgements</title>
<p>The authors are grateful to members of the Stormo lab for helpful comments and suggestions and to the anonymous reviewers for providing comments that further improved the manuscript.</p>
<sec id="FPar1">
<title>Funding</title>
<p id="Par23">This work has been supported by the National Institutes of Health [grant numbers HG000249, T32 HG000045].</p>
</sec>
<sec id="FPar2">
<title>Availability of data and materials</title>
<p id="Par24">Project name: Discriminative Additive Model Optimization (DAMO).</p>
<p id="Par25">Project home page:
<ext-link ext-link-type="uri" xlink:href="https://github.com/sx-ruan/DAMO">https://github.com/sx-ruan/DAMO</ext-link>
</p>
<p id="Par26">Operating system(s): Platform independent.</p>
<p id="Par27">Programming language: Python 2.7.</p>
<p id="Par28">License: GNU GPL v3.0.</p>
<p id="Par29">Any restrictions to use by non-academics: license needed.</p>
</sec>
</ack>
<notes notes-type="author-contribution">
<title>Authors’ contributions</title>
<p>GDS conceived the research project. SR developed the software and performed the experiments. GDS and SR analyzed and interpreted the data. GDS and SR wrote, reviewed and approved the final manuscript.</p>
</notes>
<notes notes-type="COI-statement">
<sec id="FPar3">
<title>Ethics approval and consent to participate</title>
<p id="Par30">Not applicable.</p>
</sec>
<sec id="FPar4">
<title>Consent for publication</title>
<p id="Par31">Not applicable.</p>
</sec>
<sec id="FPar5">
<title>Competing interests</title>
<p id="Par32">The authors declare that they have no competing interests.</p>
</sec>
</notes>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Buratowski</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>The basics of basal transcription by RNA polymerase II</article-title>
<source>Cell</source>
<year>1994</year>
<volume>77</volume>
<issue>1</issue>
<fpage>1</fpage>
<lpage>3</lpage>
<pub-id pub-id-type="doi">10.1016/0092-8674(94)90226-7</pub-id>
<pub-id pub-id-type="pmid">8156586</pub-id>
</element-citation>
</ref>
<ref id="CR2">
<label>2.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>McGhee</surname>
<given-names>JD</given-names>
</name>
<name>
<surname>Felsenfeld</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Nucleosome structure</article-title>
<source>Annu Rev Biochem</source>
<year>1980</year>
<volume>49</volume>
<fpage>1115</fpage>
<lpage>1156</lpage>
<pub-id pub-id-type="doi">10.1146/annurev.bi.49.070180.005343</pub-id>
<pub-id pub-id-type="pmid">6996562</pub-id>
</element-citation>
</ref>
<ref id="CR3">
<label>3.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jones</surname>
<given-names>PA</given-names>
</name>
</person-group>
<article-title>Functions of DNA methylation: islands, start sites, gene bodies and beyond</article-title>
<source>Nat Rev Genet</source>
<year>2012</year>
<volume>13</volume>
<issue>7</issue>
<fpage>484</fpage>
<lpage>492</lpage>
<pub-id pub-id-type="doi">10.1038/nrg3230</pub-id>
<pub-id pub-id-type="pmid">22641018</pub-id>
</element-citation>
</ref>
<ref id="CR4">
<label>4.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Stormo</surname>
<given-names>GD</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>Determining the specificity of protein-DNA interactions</article-title>
<source>Nat Rev Genet</source>
<year>2010</year>
<volume>11</volume>
<issue>11</issue>
<fpage>751</fpage>
<lpage>760</lpage>
<pub-id pub-id-type="doi">10.1038/nrg2845</pub-id>
<pub-id pub-id-type="pmid">20877328</pub-id>
</element-citation>
</ref>
<ref id="CR5">
<label>5.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pabo</surname>
<given-names>CO</given-names>
</name>
<name>
<surname>Sauer</surname>
<given-names>RT</given-names>
</name>
</person-group>
<article-title>Transcription factors: structural families and principles of DNA recognition</article-title>
<source>Annu Rev Biochem</source>
<year>1992</year>
<volume>61</volume>
<fpage>1053</fpage>
<lpage>1095</lpage>
<pub-id pub-id-type="doi">10.1146/annurev.bi.61.070192.005201</pub-id>
<pub-id pub-id-type="pmid">1497306</pub-id>
</element-citation>
</ref>
<ref id="CR6">
<label>6.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>de Boer</surname>
<given-names>CG</given-names>
</name>
<name>
<surname>Hughes</surname>
<given-names>TR</given-names>
</name>
</person-group>
<article-title>YeTFaSCo: a database of evaluated yeast transcription factor sequence specificities</article-title>
<source>Nucleic Acids Res</source>
<year>2012</year>
<volume>40</volume>
<issue>Database issue</issue>
<fpage>D169</fpage>
<lpage>D179</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkr993</pub-id>
<pub-id pub-id-type="pmid">22102575</pub-id>
</element-citation>
</ref>
<ref id="CR7">
<label>7.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rhee</surname>
<given-names>DY</given-names>
</name>
<name>
<surname>Cho</surname>
<given-names>DY</given-names>
</name>
<name>
<surname>Zhai</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Slattery</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Mintseris</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>CY</given-names>
</name>
<name>
<surname>White</surname>
<given-names>KP</given-names>
</name>
<name>
<surname>Celniker</surname>
<given-names>SE</given-names>
</name>
<name>
<surname>Przytycka</surname>
<given-names>TM</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Transcription factor networks in Drosophila melanogaster</article-title>
<source>Cell Rep</source>
<year>2014</year>
<volume>8</volume>
<issue>6</issue>
<fpage>2031</fpage>
<lpage>2043</lpage>
<pub-id pub-id-type="doi">10.1016/j.celrep.2014.08.038</pub-id>
<pub-id pub-id-type="pmid">25242320</pub-id>
</element-citation>
</ref>
<ref id="CR8">
<label>8.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vaquerizas</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Kummerfeld</surname>
<given-names>SK</given-names>
</name>
<name>
<surname>Teichmann</surname>
<given-names>SA</given-names>
</name>
<name>
<surname>Luscombe</surname>
<given-names>NM</given-names>
</name>
</person-group>
<article-title>A census of human transcription factors: function, expression and evolution</article-title>
<source>Nat Rev Genet</source>
<year>2009</year>
<volume>10</volume>
<issue>4</issue>
<fpage>252</fpage>
<lpage>263</lpage>
<pub-id pub-id-type="doi">10.1038/nrg2538</pub-id>
<pub-id pub-id-type="pmid">19274049</pub-id>
</element-citation>
</ref>
<ref id="CR9">
<label>9.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mertin</surname>
<given-names>S</given-names>
</name>
<name>
<surname>McDowall</surname>
<given-names>SG</given-names>
</name>
<name>
<surname>Harley</surname>
<given-names>VR</given-names>
</name>
</person-group>
<article-title>The DNA-binding specificity of SOX9 and other SOX proteins</article-title>
<source>Nucleic Acids Res</source>
<year>1999</year>
<volume>27</volume>
<issue>5</issue>
<fpage>1359</fpage>
<lpage>1364</lpage>
<pub-id pub-id-type="doi">10.1093/nar/27.5.1359</pub-id>
<pub-id pub-id-type="pmid">9973626</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kouzarides</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Ziff</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Leucine zippers of fos, Jun and GCN4 dictate dimerization specificity and thereby control DNA binding</article-title>
<source>Nature</source>
<year>1989</year>
<volume>340</volume>
<issue>6234</issue>
<fpage>568</fpage>
<lpage>571</lpage>
<pub-id pub-id-type="doi">10.1038/340568a0</pub-id>
<pub-id pub-id-type="pmid">2505081</pub-id>
</element-citation>
</ref>
<ref id="CR11">
<label>11.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hai</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Curran</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Cross-family dimerization of transcription factors Fos/Jun and ATF/CREB alters DNA binding specificity</article-title>
<source>Proc Natl Acad Sci U S A</source>
<year>1991</year>
<volume>88</volume>
<issue>9</issue>
<fpage>3720</fpage>
<lpage>3724</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.88.9.3720</pub-id>
<pub-id pub-id-type="pmid">1827203</pub-id>
</element-citation>
</ref>
<ref id="CR12">
<label>12.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Al-Sarraj</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Day</surname>
<given-names>RM</given-names>
</name>
<name>
<surname>Thiel</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Specificity of transcriptional regulation by the zinc finger transcription factors Sp1, Sp3, and Egr-1</article-title>
<source>J Cell Biochem</source>
<year>2005</year>
<volume>94</volume>
<issue>1</issue>
<fpage>153</fpage>
<lpage>167</lpage>
<pub-id pub-id-type="doi">10.1002/jcb.20305</pub-id>
<pub-id pub-id-type="pmid">15523672</pub-id>
</element-citation>
</ref>
<ref id="CR13">
<label>13.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Weirauch</surname>
<given-names>MT</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Albu</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Cote</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Montenegro-Montero</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Drewe</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Najafabadi</surname>
<given-names>HS</given-names>
</name>
<name>
<surname>Lambert</surname>
<given-names>SA</given-names>
</name>
<name>
<surname>Mann</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Cook</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Determination and inference of eukaryotic transcription factor sequence specificity</article-title>
<source>Cell</source>
<year>2014</year>
<volume>158</volume>
<issue>6</issue>
<fpage>1431</fpage>
<lpage>1443</lpage>
<pub-id pub-id-type="doi">10.1016/j.cell.2014.08.009</pub-id>
<pub-id pub-id-type="pmid">25215497</pub-id>
</element-citation>
</ref>
<ref id="CR14">
<label>14.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jolma</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Kivioja</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Toivonen</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Cheng</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Enge</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Taipale</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Vaquerizas</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Yan</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Sillanpaa</surname>
<given-names>MJ</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities</article-title>
<source>Genome Res</source>
<year>2010</year>
<volume>20</volume>
<issue>6</issue>
<fpage>861</fpage>
<lpage>873</lpage>
<pub-id pub-id-type="doi">10.1101/gr.100552.109</pub-id>
<pub-id pub-id-type="pmid">20378718</pub-id>
</element-citation>
</ref>
<ref id="CR15">
<label>15.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Berger</surname>
<given-names>MF</given-names>
</name>
<name>
<surname>Philippakis</surname>
<given-names>AA</given-names>
</name>
<name>
<surname>Qureshi</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>He</surname>
<given-names>FS</given-names>
</name>
<name>
<surname>Estep</surname>
<given-names>PW</given-names>
<suffix>3rd</suffix>
</name>
<name>
<surname>Bulyk</surname>
<given-names>ML</given-names>
</name>
</person-group>
<article-title>Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities</article-title>
<source>Nat Biotechnol</source>
<year>2006</year>
<volume>24</volume>
<issue>11</issue>
<fpage>1429</fpage>
<lpage>1435</lpage>
<pub-id pub-id-type="doi">10.1038/nbt1246</pub-id>
<pub-id pub-id-type="pmid">16998473</pub-id>
</element-citation>
</ref>
<ref id="CR16">
<label>16.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Valouev</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Johnson</surname>
<given-names>DS</given-names>
</name>
<name>
<surname>Sundquist</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Medina</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Anton</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Batzoglou</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>RM</given-names>
</name>
<name>
<surname>Sidow</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data</article-title>
<source>Nat Methods</source>
<year>2008</year>
<volume>5</volume>
<issue>9</issue>
<fpage>829</fpage>
<lpage>834</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.1246</pub-id>
<pub-id pub-id-type="pmid">19160518</pub-id>
</element-citation>
</ref>
<ref id="CR17">
<label>17.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Martini</surname>
<given-names>GD</given-names>
</name>
<name>
<surname>Rube</surname>
<given-names>HT</given-names>
</name>
<name>
<surname>Kribelbauer</surname>
<given-names>JF</given-names>
</name>
<name>
<surname>Rastogi</surname>
<given-names>C</given-names>
</name>
<name>
<surname>FitzPatrick</surname>
<given-names>VD</given-names>
</name>
<name>
<surname>Houtman</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Bussemaker</surname>
<given-names>HJ</given-names>
</name>
<name>
<surname>Pufall</surname>
<given-names>MA</given-names>
</name>
</person-group>
<article-title>SelexGLM differentiates androgen and glucocorticoid receptor DNA-binding preference over an extended binding site</article-title>
<source>Genome Res</source>
<year>2018</year>
<volume>28</volume>
<issue>1</issue>
<fpage>111</fpage>
<lpage>121</lpage>
<pub-id pub-id-type="doi">10.1101/gr.222844.117</pub-id>
<pub-id pub-id-type="pmid">29196557</pub-id>
</element-citation>
</ref>
<ref id="CR18">
<label>18.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Stormo</surname>
<given-names>GD</given-names>
</name>
</person-group>
<article-title>Modeling the specificity of protein-DNA interactions</article-title>
<source>Quant Biol</source>
<year>2013</year>
<volume>1</volume>
<issue>2</issue>
<fpage>115</fpage>
<lpage>130</lpage>
<pub-id pub-id-type="doi">10.1007/s40484-013-0012-4</pub-id>
<pub-id pub-id-type="pmid">25045190</pub-id>
</element-citation>
</ref>
<ref id="CR19">
<label>19.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Foat</surname>
<given-names>BC</given-names>
</name>
<name>
<surname>Morozov</surname>
<given-names>AV</given-names>
</name>
<name>
<surname>Bussemaker</surname>
<given-names>HJ</given-names>
</name>
</person-group>
<article-title>Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE</article-title>
<source>Bioinformatics</source>
<year>2006</year>
<volume>22</volume>
<issue>14</issue>
<fpage>e141</fpage>
<lpage>e149</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btl223</pub-id>
<pub-id pub-id-type="pmid">16873464</pub-id>
</element-citation>
</ref>
<ref id="CR20">
<label>20.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ruan</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Swamidass</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Stormo</surname>
<given-names>GD</given-names>
</name>
</person-group>
<article-title>BEESEM: estimation of binding energy models using HT-SELEX data</article-title>
<source>Bioinformatics</source>
<year>2017</year>
<volume>33</volume>
<issue>15</issue>
<fpage>2288</fpage>
<lpage>2295</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btx191</pub-id>
<pub-id pub-id-type="pmid">28379348</pub-id>
</element-citation>
</ref>
<ref id="CR21">
<label>21.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Stormo</surname>
<given-names>GD</given-names>
</name>
<name>
<surname>Schneider</surname>
<given-names>TD</given-names>
</name>
<name>
<surname>Gold</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Quantitative analysis of the relationship between nucleotide sequence and functional activity</article-title>
<source>Nucleic Acids Res</source>
<year>1986</year>
<volume>14</volume>
<issue>16</issue>
<fpage>6661</fpage>
<lpage>6679</lpage>
<pub-id pub-id-type="doi">10.1093/nar/14.16.6661</pub-id>
<pub-id pub-id-type="pmid">3092188</pub-id>
</element-citation>
</ref>
<ref id="CR22">
<label>22.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Weirauch</surname>
<given-names>MT</given-names>
</name>
<name>
<surname>Cote</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Norel</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Annala</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Riley</surname>
<given-names>TR</given-names>
</name>
<name>
<surname>Saez-Rodriguez</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Cokelaer</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Vedenko</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Talukder</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Evaluation of methods for modeling transcription factor sequence specificity</article-title>
<source>Nat Biotechnol</source>
<year>2013</year>
<volume>31</volume>
<issue>2</issue>
<fpage>126</fpage>
<lpage>134</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.2486</pub-id>
<pub-id pub-id-type="pmid">23354101</pub-id>
</element-citation>
</ref>
<ref id="CR23">
<label>23.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Benos</surname>
<given-names>PV</given-names>
</name>
<name>
<surname>Bulyk</surname>
<given-names>ML</given-names>
</name>
<name>
<surname>Stormo</surname>
<given-names>GD</given-names>
</name>
</person-group>
<article-title>Additivity in protein-DNA interactions: how good an approximation is it?</article-title>
<source>Nucleic Acids Res</source>
<year>2002</year>
<volume>30</volume>
<issue>20</issue>
<fpage>4442</fpage>
<lpage>4451</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkf578</pub-id>
<pub-id pub-id-type="pmid">12384591</pub-id>
</element-citation>
</ref>
<ref id="CR24">
<label>24.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Stormo</surname>
<given-names>GD</given-names>
</name>
</person-group>
<article-title>Quantitative analysis demonstrates most transcription factors require only simple models of specificity</article-title>
<source>Nat Biotechnol</source>
<year>2011</year>
<volume>29</volume>
<issue>6</issue>
<fpage>480</fpage>
<lpage>483</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.1893</pub-id>
<pub-id pub-id-type="pmid">21654662</pub-id>
</element-citation>
</ref>
<ref id="CR25">
<label>25.</label>
<mixed-citation publication-type="other">Agius P, Arvey A, Chang W, Noble WS, Leslie C. High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions. PLoS Comput Biol. 2010;6(9)</mixed-citation>
</ref>
<ref id="CR26">
<label>26.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jiang</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>JS</given-names>
</name>
<name>
<surname>Bulyk</surname>
<given-names>ML</given-names>
</name>
</person-group>
<article-title>Bayesian hierarchical model of protein-binding microarray k-mer data reduces noise and identifies transcription factor subclasses and preferred k-mers</article-title>
<source>Bioinformatics</source>
<year>2013</year>
<volume>29</volume>
<issue>11</issue>
<fpage>1390</fpage>
<lpage>1398</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt152</pub-id>
<pub-id pub-id-type="pmid">23559638</pub-id>
</element-citation>
</ref>
<ref id="CR27">
<label>27.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Ruan</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Pandey</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Stormo</surname>
<given-names>GD</given-names>
</name>
</person-group>
<article-title>Improved models for transcription factor binding site identification using nonindependent interactions</article-title>
<source>Genetics</source>
<year>2012</year>
<volume>191</volume>
<issue>3</issue>
<fpage>781</fpage>
<lpage>790</lpage>
<pub-id pub-id-type="doi">10.1534/genetics.112.138685</pub-id>
<pub-id pub-id-type="pmid">22505627</pub-id>
</element-citation>
</ref>
<ref id="CR28">
<label>28.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Abe</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Dror</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Slattery</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Bussemaker</surname>
<given-names>HJ</given-names>
</name>
<name>
<surname>Rohs</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Mann</surname>
<given-names>RS</given-names>
</name>
</person-group>
<article-title>Deconvolving the recognition of DNA shape from sequence</article-title>
<source>Cell</source>
<year>2015</year>
<volume>161</volume>
<issue>2</issue>
<fpage>307</fpage>
<lpage>318</lpage>
<pub-id pub-id-type="doi">10.1016/j.cell.2015.02.008</pub-id>
<pub-id pub-id-type="pmid">25843630</pub-id>
</element-citation>
</ref>
<ref id="CR29">
<label>29.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rohs</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Jin</surname>
<given-names>X</given-names>
</name>
<name>
<surname>West</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Joshi</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Honig</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Mann</surname>
<given-names>RS</given-names>
</name>
</person-group>
<article-title>Origins of specificity in protein-DNA recognition</article-title>
<source>Annu Rev Biochem</source>
<year>2010</year>
<volume>79</volume>
<fpage>233</fpage>
<lpage>269</lpage>
<pub-id pub-id-type="doi">10.1146/annurev-biochem-060408-091030</pub-id>
<pub-id pub-id-type="pmid">20334529</pub-id>
</element-citation>
</ref>
<ref id="CR30">
<label>30.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rohs</surname>
<given-names>R</given-names>
</name>
<name>
<surname>West</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Sosinsky</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Mann</surname>
<given-names>RS</given-names>
</name>
<name>
<surname>Honig</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>The role of DNA shape in protein-DNA recognition</article-title>
<source>Nature</source>
<year>2009</year>
<volume>461</volume>
<issue>7268</issue>
<fpage>1248</fpage>
<lpage>1253</lpage>
<pub-id pub-id-type="doi">10.1038/nature08473</pub-id>
<pub-id pub-id-type="pmid">19865164</pub-id>
</element-citation>
</ref>
<ref id="CR31">
<label>31.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Abe</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Horton</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Mann</surname>
<given-names>RS</given-names>
</name>
<name>
<surname>Bussemaker</surname>
<given-names>HJ</given-names>
</name>
<name>
<surname>Gordan</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Rohs</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Quantitative modeling of transcription factor binding specificities using DNA shape</article-title>
<source>Proc Natl Acad Sci U S A</source>
<year>2015</year>
<volume>112</volume>
<issue>15</issue>
<fpage>4654</fpage>
<lpage>4659</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.1422023112</pub-id>
<pub-id pub-id-type="pmid">25775564</pub-id>
</element-citation>
</ref>
<ref id="CR32">
<label>32.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Dror</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Dantas Machado</surname>
<given-names>AC</given-names>
</name>
<name>
<surname>Ghane</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Di Felice</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Rohs</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>DNAshape: a method for the high-throughput prediction of DNA structural features on a genomic scale</article-title>
<source>Nucleic Acids Res</source>
<year>2013</year>
<volume>41</volume>
<issue>Web Server issue</issue>
<fpage>W56</fpage>
<lpage>W62</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkt437</pub-id>
<pub-id pub-id-type="pmid">23703209</pub-id>
</element-citation>
</ref>
<ref id="CR33">
<label>33.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chiu</surname>
<given-names>TP</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Main</surname>
<given-names>BJ</given-names>
</name>
<name>
<surname>Parker</surname>
<given-names>SC</given-names>
</name>
<name>
<surname>Nuzhdin</surname>
<given-names>SV</given-names>
</name>
<name>
<surname>Tullius</surname>
<given-names>TD</given-names>
</name>
<name>
<surname>Rohs</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>GBshape: a genome browser database for DNA shape annotations</article-title>
<source>Nucleic Acids Res</source>
<year>2015</year>
<volume>43</volume>
<issue>Database issue</issue>
<fpage>D103</fpage>
<lpage>D109</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gku977</pub-id>
<pub-id pub-id-type="pmid">25326329</pub-id>
</element-citation>
</ref>
<ref id="CR34">
<label>34.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mathelier</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Xin</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Chiu</surname>
<given-names>TP</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Rohs</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Wasserman</surname>
<given-names>WW</given-names>
</name>
</person-group>
<article-title>DNA shape features improve transcription factor binding site predictions in vivo</article-title>
<source>Cell Syst</source>
<year>2016</year>
<volume>3</volume>
<issue>3</issue>
<fpage>278</fpage>
<lpage>286</lpage>
<pub-id pub-id-type="doi">10.1016/j.cels.2016.07.001</pub-id>
<pub-id pub-id-type="pmid">27546793</pub-id>
</element-citation>
</ref>
<ref id="CR35">
<label>35.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Patel</surname>
<given-names>RY</given-names>
</name>
<name>
<surname>Stormo</surname>
<given-names>GD</given-names>
</name>
</person-group>
<article-title>Discriminative motif optimization based on perceptron training</article-title>
<source>Bioinformatics</source>
<year>2014</year>
<volume>30</volume>
<issue>7</issue>
<fpage>941</fpage>
<lpage>948</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt748</pub-id>
<pub-id pub-id-type="pmid">24369152</pub-id>
</element-citation>
</ref>
<ref id="CR36">
<label>36.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ruan</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Stormo</surname>
<given-names>GD</given-names>
</name>
</person-group>
<article-title>Inherent limitations of probabilistic models for protein-DNA binding specificity</article-title>
<source>PLoS Comput Biol</source>
<year>2017</year>
<volume>13</volume>
<issue>7</issue>
<fpage>e1005638</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pcbi.1005638</pub-id>
<pub-id pub-id-type="pmid">28686588</pub-id>
</element-citation>
</ref>
<ref id="CR37">
<label>37.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Consortium</surname>
<given-names>EP</given-names>
</name>
</person-group>
<article-title>An integrated encyclopedia of DNA elements in the human genome</article-title>
<source>Nature</source>
<year>2012</year>
<volume>489</volume>
<issue>7414</issue>
<fpage>57</fpage>
<lpage>74</lpage>
<pub-id pub-id-type="doi">10.1038/nature11247</pub-id>
<pub-id pub-id-type="pmid">22955616</pub-id>
</element-citation>
</ref>
<ref id="CR38">
<label>38.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kent</surname>
<given-names>WJ</given-names>
</name>
<name>
<surname>Sugnet</surname>
<given-names>CW</given-names>
</name>
<name>
<surname>Furey</surname>
<given-names>TS</given-names>
</name>
<name>
<surname>Roskin</surname>
<given-names>KM</given-names>
</name>
<name>
<surname>Pringle</surname>
<given-names>TH</given-names>
</name>
<name>
<surname>Zahler</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Haussler</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>The human genome browser at UCSC</article-title>
<source>Genome Res</source>
<year>2002</year>
<volume>12</volume>
<issue>6</issue>
<fpage>996</fpage>
<lpage>1006</lpage>
<pub-id pub-id-type="doi">10.1101/gr.229102</pub-id>
<pub-id pub-id-type="pmid">12045153</pub-id>
</element-citation>
</ref>
<ref id="CR39">
<label>39.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Spiro</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Bazett-Jones</surname>
<given-names>DP</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>X</given-names>
</name>
<name>
<surname>McMurray</surname>
<given-names>CT</given-names>
</name>
</person-group>
<article-title>DNA structure determines protein binding and transcriptional efficiency of the proenkephalin cAMP-responsive enhancer</article-title>
<source>J Biol Chem</source>
<year>1995</year>
<volume>270</volume>
<issue>46</issue>
<fpage>27702</fpage>
<lpage>27710</lpage>
<pub-id pub-id-type="doi">10.1074/jbc.270.46.27702</pub-id>
<pub-id pub-id-type="pmid">7499237</pub-id>
</element-citation>
</ref>
<ref id="CR40">
<label>40.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Orenstein</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Shamir</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>A comparative analysis of transcription factor binding models learned from PBM, HT-SELEX and ChIP data</article-title>
<source>Nucleic Acids Res</source>
<year>2014</year>
<volume>42</volume>
<issue>8</issue>
<fpage>e63</fpage>
<pub-id pub-id-type="doi">10.1093/nar/gku117</pub-id>
<pub-id pub-id-type="pmid">24500199</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000271 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000271 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:5840810
   |texte=   Comparison of discriminative motif optimization using matrix and DNA shape-based models
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:29510689" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021