Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Prediction of fine-tuned promoter activity from DNA sequence

Identifieur interne : 000897 ( Pmc/Corpus ); précédent : 000896; suivant : 000898

Prediction of fine-tuned promoter activity from DNA sequence

Auteurs : Geoffrey Siwo ; Andrew Rider ; Asako Tan ; Richard Pinapati ; Scott Emrich ; Nitesh Chawla ; Michael Ferdig

Source :

RBID : PMC:4916984

Abstract

The quantitative prediction of transcriptional activity of genes using promoter sequence is fundamental to the engineering of biological systems for industrial purposes and understanding the natural variation in gene expression. To catalyze the development of new algorithms for this purpose, the Dialogue on Reverse Engineering Assessment and Methods (DREAM) organized a community challenge seeking predictive models of promoter activity given normalized promoter activity data for 90 ribosomal protein promoters driving expression of a fluorescent reporter gene. By developing an unbiased modeling approach that performs an iterative search for predictive DNA sequence features using the frequencies of various k-mers, inferred DNA mechanical properties and spatial positions of promoter sequences, we achieved the best performer status in this challenge. The specific predictive features used in the model included the frequency of the nucleotide G, the length of polymeric tracts of T and TA, the frequencies of 6 distinct trinucleotides and 12 tetranucleotides, and the predicted protein deformability of the DNA sequence. Our method accurately predicted the activity of 20 natural variants of ribosomal protein promoters (Spearman correlation r = 0.73) as compared to 33 laboratory-mutated variants of the promoters (r = 0.57) in a test set that was hidden from participants. Notably, our model differed substantially from the rest in 2 main ways: i) it did not explicitly utilize transcription factor binding information implying that subtle DNA sequence features are highly associated with gene expression, and ii) it was entirely based on features extracted exclusively from the 100 bp region upstream from the translational start site demonstrating that this region encodes much of the overall promoter activity. The findings from this study have important implications for the engineering of predictable gene expression systems and the evolution of gene expression in naturally occurring biological systems.


Url:
DOI: 10.12688/f1000research.7485.1
PubMed: 27347373
PubMed Central: 4916984

Links to Exploration step

PMC:4916984

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Prediction of fine-tuned promoter activity from DNA sequence</title>
<author>
<name sortKey="Siwo, Geoffrey" sort="Siwo, Geoffrey" uniqKey="Siwo G" first="Geoffrey" last="Siwo">Geoffrey Siwo</name>
<affiliation>
<nlm:aff id="a1">Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a2">Department of Biological Sciences, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a4">Interdisciplinary Center for Network Science and Applications (iCeNSA), University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a5">IBM TJ Watson Research Center, NY, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a6">IBM Research-Africa, Johannesberg, South Africa</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Rider, Andrew" sort="Rider, Andrew" uniqKey="Rider A" first="Andrew" last="Rider">Andrew Rider</name>
<affiliation>
<nlm:aff id="a1">Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a3">Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a4">Interdisciplinary Center for Network Science and Applications (iCeNSA), University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Tan, Asako" sort="Tan, Asako" uniqKey="Tan A" first="Asako" last="Tan">Asako Tan</name>
<affiliation>
<nlm:aff id="a1">Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a2">Department of Biological Sciences, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a7">Epicentre, Madison, WI, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Pinapati, Richard" sort="Pinapati, Richard" uniqKey="Pinapati R" first="Richard" last="Pinapati">Richard Pinapati</name>
<affiliation>
<nlm:aff id="a1">Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a2">Department of Biological Sciences, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a4">Interdisciplinary Center for Network Science and Applications (iCeNSA), University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Emrich, Scott" sort="Emrich, Scott" uniqKey="Emrich S" first="Scott" last="Emrich">Scott Emrich</name>
<affiliation>
<nlm:aff id="a1">Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a3">Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a4">Interdisciplinary Center for Network Science and Applications (iCeNSA), University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Chawla, Nitesh" sort="Chawla, Nitesh" uniqKey="Chawla N" first="Nitesh" last="Chawla">Nitesh Chawla</name>
<affiliation>
<nlm:aff id="a1">Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a3">Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a4">Interdisciplinary Center for Network Science and Applications (iCeNSA), University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ferdig, Michael" sort="Ferdig, Michael" uniqKey="Ferdig M" first="Michael" last="Ferdig">Michael Ferdig</name>
<affiliation>
<nlm:aff id="a1">Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a2">Department of Biological Sciences, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a4">Interdisciplinary Center for Network Science and Applications (iCeNSA), University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">27347373</idno>
<idno type="pmc">4916984</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4916984</idno>
<idno type="RBID">PMC:4916984</idno>
<idno type="doi">10.12688/f1000research.7485.1</idno>
<date when="2016">2016</date>
<idno type="wicri:Area/Pmc/Corpus">000897</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000897</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Prediction of fine-tuned promoter activity from DNA sequence</title>
<author>
<name sortKey="Siwo, Geoffrey" sort="Siwo, Geoffrey" uniqKey="Siwo G" first="Geoffrey" last="Siwo">Geoffrey Siwo</name>
<affiliation>
<nlm:aff id="a1">Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a2">Department of Biological Sciences, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a4">Interdisciplinary Center for Network Science and Applications (iCeNSA), University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a5">IBM TJ Watson Research Center, NY, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a6">IBM Research-Africa, Johannesberg, South Africa</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Rider, Andrew" sort="Rider, Andrew" uniqKey="Rider A" first="Andrew" last="Rider">Andrew Rider</name>
<affiliation>
<nlm:aff id="a1">Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a3">Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a4">Interdisciplinary Center for Network Science and Applications (iCeNSA), University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Tan, Asako" sort="Tan, Asako" uniqKey="Tan A" first="Asako" last="Tan">Asako Tan</name>
<affiliation>
<nlm:aff id="a1">Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a2">Department of Biological Sciences, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a7">Epicentre, Madison, WI, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Pinapati, Richard" sort="Pinapati, Richard" uniqKey="Pinapati R" first="Richard" last="Pinapati">Richard Pinapati</name>
<affiliation>
<nlm:aff id="a1">Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a2">Department of Biological Sciences, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a4">Interdisciplinary Center for Network Science and Applications (iCeNSA), University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Emrich, Scott" sort="Emrich, Scott" uniqKey="Emrich S" first="Scott" last="Emrich">Scott Emrich</name>
<affiliation>
<nlm:aff id="a1">Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a3">Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a4">Interdisciplinary Center for Network Science and Applications (iCeNSA), University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Chawla, Nitesh" sort="Chawla, Nitesh" uniqKey="Chawla N" first="Nitesh" last="Chawla">Nitesh Chawla</name>
<affiliation>
<nlm:aff id="a1">Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a3">Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a4">Interdisciplinary Center for Network Science and Applications (iCeNSA), University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ferdig, Michael" sort="Ferdig, Michael" uniqKey="Ferdig M" first="Michael" last="Ferdig">Michael Ferdig</name>
<affiliation>
<nlm:aff id="a1">Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a2">Department of Biological Sciences, University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="a4">Interdisciplinary Center for Network Science and Applications (iCeNSA), University of Notre Dame, Notre Dame, IN, USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">F1000Research</title>
<idno type="eISSN">2046-1402</idno>
<imprint>
<date when="2016">2016</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>The quantitative prediction of transcriptional activity of genes using promoter sequence is fundamental to the engineering of biological systems for industrial purposes and understanding the natural variation in gene expression. To catalyze the development of new algorithms for this purpose, the Dialogue on Reverse Engineering Assessment and Methods (DREAM) organized a community challenge seeking predictive models of promoter activity given normalized promoter activity data for 90 ribosomal protein promoters driving expression of a fluorescent reporter gene. By developing an unbiased modeling approach that performs an iterative search for predictive DNA sequence features using the frequencies of various k-mers, inferred DNA mechanical properties and spatial positions of promoter sequences, we achieved the best performer status in this challenge. The specific predictive features used in the model included the frequency of the nucleotide G, the length of polymeric tracts of T and TA, the frequencies of 6 distinct trinucleotides and 12 tetranucleotides, and the predicted protein deformability of the DNA sequence. Our method accurately predicted the activity of 20 natural variants of ribosomal protein promoters (Spearman correlation r = 0.73) as compared to 33 laboratory-mutated variants of the promoters (r = 0.57) in a test set that was hidden from participants. Notably, our model differed substantially from the rest in 2 main ways: i) it did not explicitly utilize transcription factor binding information implying that subtle DNA sequence features are highly associated with gene expression, and ii) it was entirely based on features extracted exclusively from the 100 bp region upstream from the translational start site demonstrating that this region encodes much of the overall promoter activity. The findings from this study have important implications for the engineering of predictable gene expression systems and the evolution of gene expression in naturally occurring biological systems.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Schadt, Ee" uniqKey="Schadt E">EE Schadt</name>
</author>
<author>
<name sortKey="Monks, Sa" uniqKey="Monks S">SA Monks</name>
</author>
<author>
<name sortKey="Drake, Ta" uniqKey="Drake T">TA Drake</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tirosh, I" uniqKey="Tirosh I">I Tirosh</name>
</author>
<author>
<name sortKey="Reikhav, S" uniqKey="Reikhav S">S Reikhav</name>
</author>
<author>
<name sortKey="Sigal, N" uniqKey="Sigal N">N Sigal</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tirosh, I" uniqKey="Tirosh I">I Tirosh</name>
</author>
<author>
<name sortKey="Weinberger, A" uniqKey="Weinberger A">A Weinberger</name>
</author>
<author>
<name sortKey="Carmi, M" uniqKey="Carmi M">M Carmi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Field, Y" uniqKey="Field Y">Y Field</name>
</author>
<author>
<name sortKey="Fondufe Mittendorf, Y" uniqKey="Fondufe Mittendorf Y">Y Fondufe-Mittendorf</name>
</author>
<author>
<name sortKey="Moore, Ik" uniqKey="Moore I">IK Moore</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gonzales, Jm" uniqKey="Gonzales J">JM Gonzales</name>
</author>
<author>
<name sortKey="Patel, Jj" uniqKey="Patel J">JJ Patel</name>
</author>
<author>
<name sortKey="Ponmee, N" uniqKey="Ponmee N">N Ponmee</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ellis, T" uniqKey="Ellis T">T Ellis</name>
</author>
<author>
<name sortKey="Wang, X" uniqKey="Wang X">X Wang</name>
</author>
<author>
<name sortKey="Collins, Jj" uniqKey="Collins J">JJ Collins</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gertz, J" uniqKey="Gertz J">J Gertz</name>
</author>
<author>
<name sortKey="Cohen, Ba" uniqKey="Cohen B">BA Cohen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gertz, J" uniqKey="Gertz J">J Gertz</name>
</author>
<author>
<name sortKey="Siggia, Ed" uniqKey="Siggia E">ED Siggia</name>
</author>
<author>
<name sortKey="Cohen, Ba" uniqKey="Cohen B">BA Cohen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kim, Hd" uniqKey="Kim H">HD Kim</name>
</author>
<author>
<name sortKey="Shay, T" uniqKey="Shay T">T Shay</name>
</author>
<author>
<name sortKey="O Hea, Ek" uniqKey="O Hea E">EK O’Shea</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Segal, E" uniqKey="Segal E">E Segal</name>
</author>
<author>
<name sortKey="Widom, J" uniqKey="Widom J">J Widom</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Takahashi, K" uniqKey="Takahashi K">K Takahashi</name>
</author>
<author>
<name sortKey="Yamanaka, S" uniqKey="Yamanaka S">S Yamanaka</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kim, Hd" uniqKey="Kim H">HD Kim</name>
</author>
<author>
<name sortKey="O Hea, Ek" uniqKey="O Hea E">EK O’Shea</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Irie, T" uniqKey="Irie T">T Irie</name>
</author>
<author>
<name sortKey="Park, Sj" uniqKey="Park S">SJ Park</name>
</author>
<author>
<name sortKey="Yamashita, R" uniqKey="Yamashita R">R Yamashita</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cookson, W" uniqKey="Cookson W">W Cookson</name>
</author>
<author>
<name sortKey="Liang, L" uniqKey="Liang L">L Liang</name>
</author>
<author>
<name sortKey="Abecasis, G" uniqKey="Abecasis G">G Abecasis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Karczewski, Kj" uniqKey="Karczewski K">KJ Karczewski</name>
</author>
<author>
<name sortKey="Tatonetti, Np" uniqKey="Tatonetti N">NP Tatonetti</name>
</author>
<author>
<name sortKey="Landt, Sg" uniqKey="Landt S">SG Landt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mjolsness, E" uniqKey="Mjolsness E">E Mjolsness</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Das, D" uniqKey="Das D">D Das</name>
</author>
<author>
<name sortKey="Banerjee, N" uniqKey="Banerjee N">N Banerjee</name>
</author>
<author>
<name sortKey="Zhang, Mq" uniqKey="Zhang M">MQ Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lam, Fh" uniqKey="Lam F">FH Lam</name>
</author>
<author>
<name sortKey="Steger, Dj" uniqKey="Steger D">DJ Steger</name>
</author>
<author>
<name sortKey="O Hea, Ek" uniqKey="O Hea E">EK O’Shea</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mirny, La" uniqKey="Mirny L">LA Mirny</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, Xy" uniqKey="Li X">XY Li</name>
</author>
<author>
<name sortKey="Thomas, S" uniqKey="Thomas S">S Thomas</name>
</author>
<author>
<name sortKey="Sabo, Pj" uniqKey="Sabo P">PJ Sabo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Choi, Jk" uniqKey="Choi J">JK Choi</name>
</author>
<author>
<name sortKey="Kim, Yj" uniqKey="Kim Y">YJ Kim</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lidor Nili, E" uniqKey="Lidor Nili E">E Lidor Nili</name>
</author>
<author>
<name sortKey="Field, Y" uniqKey="Field Y">Y Field</name>
</author>
<author>
<name sortKey="Lubling, Y" uniqKey="Lubling Y">Y Lubling</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Raveh Sadka, T" uniqKey="Raveh Sadka T">T Raveh-Sadka</name>
</author>
<author>
<name sortKey="Levo, M" uniqKey="Levo M">M Levo</name>
</author>
<author>
<name sortKey="Segal, E" uniqKey="Segal E">E Segal</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Segal, E" uniqKey="Segal E">E Segal</name>
</author>
<author>
<name sortKey="Widom, J" uniqKey="Widom J">J Widom</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kaplan, N" uniqKey="Kaplan N">N Kaplan</name>
</author>
<author>
<name sortKey="Moore, Ik" uniqKey="Moore I">IK Moore</name>
</author>
<author>
<name sortKey="Fondufe Mittendorf, Y" uniqKey="Fondufe Mittendorf Y">Y Fondufe-Mittendorf</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Van Der Heijden, T" uniqKey="Van Der Heijden T">T van der Heijden</name>
</author>
<author>
<name sortKey="Van Vugt, Jj" uniqKey="Van Vugt J">JJ van Vugt</name>
</author>
<author>
<name sortKey="Logie, C" uniqKey="Logie C">C Logie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Segal, E" uniqKey="Segal E">E Segal</name>
</author>
<author>
<name sortKey="Widom, J" uniqKey="Widom J">J Widom</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lee, Ck" uniqKey="Lee C">CK Lee</name>
</author>
<author>
<name sortKey="Shibata, Y" uniqKey="Shibata Y">Y Shibata</name>
</author>
<author>
<name sortKey="Rao, B" uniqKey="Rao B">B Rao</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shivaswamy, S" uniqKey="Shivaswamy S">S Shivaswamy</name>
</author>
<author>
<name sortKey="Bhinge, A" uniqKey="Bhinge A">A Bhinge</name>
</author>
<author>
<name sortKey="Zhao, Y" uniqKey="Zhao Y">Y Zhao</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zeevi, D" uniqKey="Zeevi D">D Zeevi</name>
</author>
<author>
<name sortKey="Sharon, E" uniqKey="Sharon E">E Sharon</name>
</author>
<author>
<name sortKey="Lotan Pompan, M" uniqKey="Lotan Pompan M">M Lotan-Pompan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, Yh" uniqKey="Yang Y">YH Yang</name>
</author>
<author>
<name sortKey="Dudoit, S" uniqKey="Dudoit S">S Dudoit</name>
</author>
<author>
<name sortKey="Luu, P" uniqKey="Luu P">P Luu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Oshlack, A" uniqKey="Oshlack A">A Oshlack</name>
</author>
<author>
<name sortKey="Wakefield, Mj" uniqKey="Wakefield M">MJ Wakefield</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kalir, S" uniqKey="Kalir S">S Kalir</name>
</author>
<author>
<name sortKey="Mcclure, J" uniqKey="Mcclure J">J McClure</name>
</author>
<author>
<name sortKey="Pabbaraju, K" uniqKey="Pabbaraju K">K Pabbaraju</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Meyer, P" uniqKey="Meyer P">P Meyer</name>
</author>
<author>
<name sortKey="Siwo, G" uniqKey="Siwo G">G Siwo</name>
</author>
<author>
<name sortKey="Zeevi, D" uniqKey="Zeevi D">D Zeevi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brukner, I" uniqKey="Brukner I">I Brukner</name>
</author>
<author>
<name sortKey="Sanchez, R" uniqKey="Sanchez R">R Sánchez</name>
</author>
<author>
<name sortKey="Suck, D" uniqKey="Suck D">D Suck</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Olson, Wk" uniqKey="Olson W">WK Olson</name>
</author>
<author>
<name sortKey="Gorin, Aa" uniqKey="Gorin A">AA Gorin</name>
</author>
<author>
<name sortKey="Lu, Xj" uniqKey="Lu X">XJ Lu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sivolob, Av" uniqKey="Sivolob A">AV Sivolob</name>
</author>
<author>
<name sortKey="Khrapunov, Sn" uniqKey="Khrapunov S">SN Khrapunov</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Raveh Sadka, T" uniqKey="Raveh Sadka T">T Raveh-Sadka</name>
</author>
<author>
<name sortKey="Levo, M" uniqKey="Levo M">M Levo</name>
</author>
<author>
<name sortKey="Shabi, U" uniqKey="Shabi U">U Shabi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lascaris, Rf" uniqKey="Lascaris R">RF Lascaris</name>
</author>
<author>
<name sortKey="Mager, Wh" uniqKey="Mager W">WH Mager</name>
</author>
<author>
<name sortKey="Planta, Rj" uniqKey="Planta R">RJ Planta</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Packer, Mj" uniqKey="Packer M">MJ Packer</name>
</author>
<author>
<name sortKey="Dauncey, Mp" uniqKey="Dauncey M">MP Dauncey</name>
</author>
<author>
<name sortKey="Hunter, Ca" uniqKey="Hunter C">CA Hunter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Laurens, N" uniqKey="Laurens N">N Laurens</name>
</author>
<author>
<name sortKey="Rusling, Da" uniqKey="Rusling D">DA Rusling</name>
</author>
<author>
<name sortKey="Pernstich, C" uniqKey="Pernstich C">C Pernstich</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Starr, Db" uniqKey="Starr D">DB Starr</name>
</author>
<author>
<name sortKey="Hoopes, Bc" uniqKey="Hoopes B">BC Hoopes</name>
</author>
<author>
<name sortKey="Hawley, Dk" uniqKey="Hawley D">DK Hawley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vijayan, V" uniqKey="Vijayan V">V Vijayan</name>
</author>
<author>
<name sortKey="Zuzow, R" uniqKey="Zuzow R">R Zuzow</name>
</author>
<author>
<name sortKey="O Hea, Ek" uniqKey="O Hea E">EK O’Shea</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Parvin, Jd" uniqKey="Parvin J">JD Parvin</name>
</author>
<author>
<name sortKey="Mccormick, Rj" uniqKey="Mccormick R">RJ McCormick</name>
</author>
<author>
<name sortKey="Sharp, Pa" uniqKey="Sharp P">PA Sharp</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bosio, Mc" uniqKey="Bosio M">MC Bosio</name>
</author>
<author>
<name sortKey="Negri, R" uniqKey="Negri R">R Negri</name>
</author>
<author>
<name sortKey="Dieci, G" uniqKey="Dieci G">G Dieci</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yonetani, Y" uniqKey="Yonetani Y">Y Yonetani</name>
</author>
<author>
<name sortKey="Kono, H" uniqKey="Kono H">H Kono</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, B" uniqKey="Li B">B Li</name>
</author>
<author>
<name sortKey="Vilardell, J" uniqKey="Vilardell J">J Vilardell</name>
</author>
<author>
<name sortKey="Warner, Jr" uniqKey="Warner J">JR Warner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Deutschbauer, Am" uniqKey="Deutschbauer A">AM Deutschbauer</name>
</author>
<author>
<name sortKey="Jaramillo, Df" uniqKey="Jaramillo D">DF Jaramillo</name>
</author>
<author>
<name sortKey="Proctor, M" uniqKey="Proctor M">M Proctor</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Warner, Jr" uniqKey="Warner J">JR Warner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Spahn, Cm" uniqKey="Spahn C">CM Spahn</name>
</author>
<author>
<name sortKey="Beckmann, R" uniqKey="Beckmann R">R Beckmann</name>
</author>
<author>
<name sortKey="Eswar, N" uniqKey="Eswar N">N Eswar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ju, Q" uniqKey="Ju Q">Q Ju</name>
</author>
<author>
<name sortKey="Warner, Jr" uniqKey="Warner J">JR Warner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Causton, Hc" uniqKey="Causton H">HC Causton</name>
</author>
<author>
<name sortKey="Ren, B" uniqKey="Ren B">B Ren</name>
</author>
<author>
<name sortKey="Koh, Ss" uniqKey="Koh S">SS Koh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Oinn, T" uniqKey="Oinn T">T Oinn</name>
</author>
<author>
<name sortKey="Addis, M" uniqKey="Addis M">M Addis</name>
</author>
<author>
<name sortKey="Ferris, J" uniqKey="Ferris J">J Ferris</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Go I, Jr" uniqKey="Go I J">JR Goñi</name>
</author>
<author>
<name sortKey="Fenollosa, C" uniqKey="Fenollosa C">C Fenollosa</name>
</author>
<author>
<name sortKey="Perez, A" uniqKey="Perez A">A Pérez</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Satchwell, Sc" uniqKey="Satchwell S">SC Satchwell</name>
</author>
<author>
<name sortKey="Drew, Hr" uniqKey="Drew H">HR Drew</name>
</author>
<author>
<name sortKey="Travers, Aa" uniqKey="Travers A">AA Travers</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hall, M" uniqKey="Hall M">M Hall</name>
</author>
<author>
<name sortKey="Frank, E" uniqKey="Frank E">E Frank</name>
</author>
<author>
<name sortKey="Holmes, G" uniqKey="Holmes G">G Holmes</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Siwo, G" uniqKey="Siwo G">G Siwo</name>
</author>
<author>
<name sortKey="Rider, A" uniqKey="Rider A">A Rider</name>
</author>
<author>
<name sortKey="Tan, A" uniqKey="Tan A">A Tan</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="methods-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">F1000Res</journal-id>
<journal-id journal-id-type="iso-abbrev">F1000Res</journal-id>
<journal-id journal-id-type="pmc">F1000Research</journal-id>
<journal-title-group>
<journal-title>F1000Research</journal-title>
</journal-title-group>
<issn pub-type="epub">2046-1402</issn>
<publisher>
<publisher-name>F1000Research</publisher-name>
<publisher-loc>London, UK</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">27347373</article-id>
<article-id pub-id-type="pmc">4916984</article-id>
<article-id pub-id-type="doi">10.12688/f1000research.7485.1</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Method Article</subject>
</subj-group>
<subj-group>
<subject>Articles</subject>
<subj-group>
<subject>Bioinformatics</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Prediction of fine-tuned promoter activity from DNA sequence</article-title>
<fn-group content-type="pub-status">
<fn>
<p>[version 1; referees: 1 approved</p>
</fn>
</fn-group>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Siwo</surname>
<given-names>Geoffrey</given-names>
</name>
<xref ref-type="corresp" rid="c1">a</xref>
<xref ref-type="aff" rid="a1">1</xref>
<xref ref-type="aff" rid="a2">2</xref>
<xref ref-type="aff" rid="a4">4</xref>
<xref ref-type="aff" rid="a5">5</xref>
<xref ref-type="aff" rid="a6">6</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Rider</surname>
<given-names>Andrew</given-names>
</name>
<xref ref-type="aff" rid="a1">1</xref>
<xref ref-type="aff" rid="a3">3</xref>
<xref ref-type="aff" rid="a4">4</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Tan</surname>
<given-names>Asako</given-names>
</name>
<xref ref-type="aff" rid="a1">1</xref>
<xref ref-type="aff" rid="a2">2</xref>
<xref ref-type="aff" rid="a7">7</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Pinapati</surname>
<given-names>Richard</given-names>
</name>
<xref ref-type="aff" rid="a1">1</xref>
<xref ref-type="aff" rid="a2">2</xref>
<xref ref-type="aff" rid="a4">4</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Emrich</surname>
<given-names>Scott</given-names>
</name>
<xref ref-type="aff" rid="a1">1</xref>
<xref ref-type="aff" rid="a3">3</xref>
<xref ref-type="aff" rid="a4">4</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Chawla</surname>
<given-names>Nitesh</given-names>
</name>
<xref ref-type="aff" rid="a1">1</xref>
<xref ref-type="aff" rid="a3">3</xref>
<xref ref-type="aff" rid="a4">4</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Ferdig</surname>
<given-names>Michael</given-names>
</name>
<xref ref-type="aff" rid="a1">1</xref>
<xref ref-type="aff" rid="a2">2</xref>
<xref ref-type="aff" rid="a4">4</xref>
</contrib>
<aff id="a1">
<label>1</label>
Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN, USA</aff>
<aff id="a2">
<label>2</label>
Department of Biological Sciences, University of Notre Dame, Notre Dame, IN, USA</aff>
<aff id="a3">
<label>3</label>
Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, USA</aff>
<aff id="a4">
<label>4</label>
Interdisciplinary Center for Network Science and Applications (iCeNSA), University of Notre Dame, Notre Dame, IN, USA</aff>
<aff id="a5">
<label>5</label>
IBM TJ Watson Research Center, NY, USA</aff>
<aff id="a6">
<label>6</label>
IBM Research-Africa, Johannesberg, South Africa</aff>
<aff id="a7">
<label>7</label>
Epicentre, Madison, WI, USA</aff>
</contrib-group>
<author-notes>
<corresp id="c1">
<label>a</label>
<email xlink:href="mailto:siwomolbio@gmail.com">siwomolbio@gmail.com</email>
</corresp>
<fn fn-type="con">
<p>GHS, RSP, AT, AKR conceived the methods and performed the analysis. All authors wrote the manuscript.</p>
</fn>
<fn fn-type="COI-statement">
<p>
<bold>Competing interests: </bold>
The authors declare that they have no competing interests.</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>11</day>
<month>2</month>
<year>2016</year>
</pub-date>
<pub-date pub-type="collection">
<year>2016</year>
</pub-date>
<volume>5</volume>
<elocation-id>158</elocation-id>
<history>
<date date-type="accepted">
<day>8</day>
<month>2</month>
<year>2016</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright: © 2016 Siwo G et al.</copyright-statement>
<copyright-year>2016</copyright-year>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:type="simple" xlink:href="f1000research-5-8064.pdf"></self-uri>
<abstract>
<p>The quantitative prediction of transcriptional activity of genes using promoter sequence is fundamental to the engineering of biological systems for industrial purposes and understanding the natural variation in gene expression. To catalyze the development of new algorithms for this purpose, the Dialogue on Reverse Engineering Assessment and Methods (DREAM) organized a community challenge seeking predictive models of promoter activity given normalized promoter activity data for 90 ribosomal protein promoters driving expression of a fluorescent reporter gene. By developing an unbiased modeling approach that performs an iterative search for predictive DNA sequence features using the frequencies of various k-mers, inferred DNA mechanical properties and spatial positions of promoter sequences, we achieved the best performer status in this challenge. The specific predictive features used in the model included the frequency of the nucleotide G, the length of polymeric tracts of T and TA, the frequencies of 6 distinct trinucleotides and 12 tetranucleotides, and the predicted protein deformability of the DNA sequence. Our method accurately predicted the activity of 20 natural variants of ribosomal protein promoters (Spearman correlation r = 0.73) as compared to 33 laboratory-mutated variants of the promoters (r = 0.57) in a test set that was hidden from participants. Notably, our model differed substantially from the rest in 2 main ways: i) it did not explicitly utilize transcription factor binding information implying that subtle DNA sequence features are highly associated with gene expression, and ii) it was entirely based on features extracted exclusively from the 100 bp region upstream from the translational start site demonstrating that this region encodes much of the overall promoter activity. The findings from this study have important implications for the engineering of predictable gene expression systems and the evolution of gene expression in naturally occurring biological systems.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Promoter activity</kwd>
<kwd>Gene expression</kwd>
<kwd>Expression prediction</kwd>
<kwd>DREAM challenges</kwd>
<kwd>Machine learning</kwd>
<kwd>Gene regulation</kwd>
<kwd>DNA sequence</kwd>
<kwd>Transcription modeling</kwd>
</kwd-group>
<funding-group>
<funding-statement>The author(s) declared that no grants were involved in supporting this work.</funding-statement>
</funding-group>
</article-meta>
</front>
<body>
<sec sec-type="intro">
<title>Introduction</title>
<p>Transcription is a fundamental step in the decoding of information encoded in DNA into phenotypes. Therefore, knowledge of transcriptional regulation is crucial for understanding the natural variation of gene expression
<sup>
<xref rid="ref-1" ref-type="bibr">1</xref>
<xref rid="ref-5" ref-type="bibr">5</xref>
</sup>
and for the accurate engineering of predictable gene expression systems
<sup>
<xref rid="ref-6" ref-type="bibr">6</xref>
<xref rid="ref-8" ref-type="bibr">8</xref>
</sup>
. While transcriptional regulation is one of the most highly studied areas in biology, the ability to quantitatively predict gene expression from DNA sequence remains inadequate
<sup>
<xref rid="ref-9" ref-type="bibr">9</xref>
,
<xref rid="ref-10" ref-type="bibr">10</xref>
</sup>
. Knowledge of transcription factors and their cognate binding sites continues to grow and has enhanced our ability to make qualitative predictions about gene expression. For example, a number of transcription factors are now well known to be involved in differentiation of stem cells into specific cell types, leading to potentially clinically useful applications such as induced pluripotent stem cells
<sup>
<xref rid="ref-11" ref-type="bibr">11</xref>
</sup>
. Inspite of this progress, only limited quantitative predictions of gene expression are possible
<sup>
<xref rid="ref-6" ref-type="bibr">6</xref>
<xref rid="ref-8" ref-type="bibr">8</xref>
,
<xref rid="ref-12" ref-type="bibr">12</xref>
,
<xref rid="ref-13" ref-type="bibr">13</xref>
</sup>
. Knowledge that promoter sequences of genes encode both qualitative (e.g. when to switch a gene on and off) and quantitative properties (e.g. precise levels and noise) of gene expression is implied by the heritable nature of these attributes
<sup>
<xref rid="ref-1" ref-type="bibr">1</xref>
<xref rid="ref-3" ref-type="bibr">3</xref>
,
<xref rid="ref-14" ref-type="bibr">14</xref>
</sup>
. It is becoming increasingly clear that while transcription factors are critical in gene regulation, regulatory outputs are ultimately determined by co-operation between regulators in complex circuits
<sup>
<xref rid="ref-15" ref-type="bibr">15</xref>
<xref rid="ref-17" ref-type="bibr">17</xref>
</sup>
and with chromatin states
<sup>
<xref rid="ref-18" ref-type="bibr">18</xref>
<xref rid="ref-21" ref-type="bibr">21</xref>
</sup>
. In particular, transcription factors compete for DNA binding sites with nucleosomes
<sup>
<xref rid="ref-22" ref-type="bibr">22</xref>
,
<xref rid="ref-23" ref-type="bibr">23</xref>
</sup>
. The information for nucleosome binding is largely encoded in the DNA sequence
<sup>
<xref rid="ref-24" ref-type="bibr">24</xref>
<xref rid="ref-27" ref-type="bibr">27</xref>
</sup>
, even though
<italic>in vivo</italic>
nucleosome occupancy is highly dynamic
<sup>
<xref rid="ref-25" ref-type="bibr">25</xref>
,
<xref rid="ref-28" ref-type="bibr">28</xref>
,
<xref rid="ref-29" ref-type="bibr">29</xref>
</sup>
. Quantitative models of gene expression, therefore, benefit from the integration of nucleosome and transcription factor binding data
<sup>
<xref rid="ref-10" ref-type="bibr">10</xref>
,
<xref rid="ref-23" ref-type="bibr">23</xref>
,
<xref rid="ref-30" ref-type="bibr">30</xref>
</sup>
.</p>
<p>A key barrier to quantitative modeling of gene expression using promoter sequence has been the lack of experimental methods for accurately measuring transcript levels. DNA microarrays and RNA-seq are the most widely-used systems for measuring transcript abundance, but this measurement can reflect many effects including promoter sequence, genomic position of a gene and post-transcriptional regulation of mRNA levels by processes like mRNA degradation. In addition, microarray and RNA-seq can be affected by systematic biases arising from sequence dependent hybridization kinetics
<sup>
<xref rid="ref-31" ref-type="bibr">31</xref>
</sup>
and sequence dependent read-depth coverage
<sup>
<xref rid="ref-32" ref-type="bibr">32</xref>
</sup>
, respectively. To overcome these limitations, approaches based on promoters fused to fluorescent reporters have been developed to generate direct, real-time measurement of promoter activity with high accuracy
<sup>
<xref rid="ref-33" ref-type="bibr">33</xref>
</sup>
. This has been applied in large libraries of synthetic bacterial promoters thereby generating new insights on combinatorial cis-regulation
<sup>
<xref rid="ref-8" ref-type="bibr">8</xref>
</sup>
. It was not until recently that the first large-scale library of naturally occurring promoters of any eukaryote fused to yellow fluorescent protein (YFP) became available
<sup>
<xref rid="ref-30" ref-type="bibr">30</xref>
</sup>
. 110 yeast ribosomal protein (RP) promoters were fused to YFP and integrated into a different strain at a fixed genomic location, hence alleviating both post-translational and genomic context related effects
<sup>
<xref rid="ref-30" ref-type="bibr">30</xref>
</sup>
. Consequently, this data set is very well poised for the computational modeling of the relationship between promoter sequence and transcription activity of a eukaryotic promoter.</p>
<p>To provide a fair assessment of the relationship between promoter sequence and quantitative transcript levels, the Dialogue for Reverse Engineering Assessments and Methods (DREAM) organized an open community challenge in 2011 (details of the challenge as well as an overview of participating teams is provided in reference
<xref rid="ref-34" ref-type="bibr">34</xref>
), inviting participants to address this question using promoter activities of the RP promoter library that was not yet published
<sup>
<xref rid="ref-30" ref-type="bibr">30</xref>
</sup>
. Participants were provided with the activities of 90 promoters and their corresponding promoter sequences and challenged to predict the activity of 53 promoters whose activities were known only to the organizers of the challenge (
<xref ref-type="fig" rid="f1">Figure 1A</xref>
). After a period of three months, the challenge organizers independently assessed the performance of models from 21 teams using four different statistical tests. Our team, Fighting Irish Systems Team (FIrST), attained the best performance status on the basis of a combined score by the DREAM consortium in predicting the activities of these 53 promoters (Spearman correlation between predicted and actual activities r = 0.65,
<italic>P</italic>
= 0.002). Our approach was built upon three key propositions: i) transcription factor binding and nucleosome binding, as well as other regulatory signals are encoded in DNA
<sup>
<xref rid="ref-9" ref-type="bibr">9</xref>
,
<xref rid="ref-10" ref-type="bibr">10</xref>
,
<xref rid="ref-12" ref-type="bibr">12</xref>
,
<xref rid="ref-27" ref-type="bibr">27</xref>
</sup>
, ii) if i) is true, then explicit prior knowledge of transcription factor and nucleosome binding is not a mandatory prerequisite for prediction of promoter activity if training data is available. That is, an unbiased approach that explores the associations between DNA sequence patterns and promoter activity should be able to rediscover patterns that relate to the observed activity. To do this, we used machine learning methods to iteratively explore the association between promoter activity and DNA sequence patterns in 100 bp windows of promoter sequence. We considered sequence patterns such as k-mers (k = 1 to k = 5), homopolymer stretches, nucleosome binding and three mechanical properties of DNA (bendability
<sup>
<xref rid="ref-35" ref-type="bibr">35</xref>
</sup>
, deformability
<sup>
<xref rid="ref-36" ref-type="bibr">36</xref>
</sup>
and stiffness
<sup>
<xref rid="ref-37" ref-type="bibr">37</xref>
</sup>
). Based on iterative exploration of different machine learning models, we established that a support vector machine (SVM) was the most predictive of promoter activity based on specific sequence patterns in the 100 bp upstream of the translation start site (TrSS). Our model outperformed those which applied transcription factor binding sites of known RP promoters
<sup>
<xref rid="ref-34" ref-type="bibr">34</xref>
</sup>
, implying that other sequence patterns besides transcription factor binding sites can help in fine-tuning gene expression. Indeed, among the predictive features employed by our model were poly(dT-dA) tracts that occlude nucleosomes; these have since been applied to fine-tune gene expression beyond resolutions attainable by transcription factor site mutations
<sup>
<xref rid="ref-38" ref-type="bibr">38</xref>
</sup>
. Our study expands the understanding of sequence patterns that could potentially be useful in engineering fine-tuned gene expression.</p>
<fig fig-type="figure" id="f1" orientation="portrait" position="float">
<label>Figure 1. </label>
<caption>
<title>Summary of the DREAM6 gene expression challenge.</title>
<p>(
<bold>A</bold>
) Training data consisted of DNA sequences for 90 yeast RP promoters whose activities were experimentally determined
<sup>
<xref rid="ref-30" ref-type="bibr">30</xref>
,
<xref rid="ref-34" ref-type="bibr">34</xref>
</sup>
. DNA sequences for blinded test set of 53 promoters whose activity was hidden also experimentally determined but withheld from the challenge participants was also provided. (
<bold>B</bold>
) Outline for strategy of modeling promoter activity. Each promoter was segmented into 100 bp non-overlapping windows with the full promoter regarded as a separate window. For each window, DNA sequence features were extracted and feature selection using a linear regression wrapper performed prior to machine learning. Performance of machine learning models trained on each window was determined in 5- and 10-fold cross-validations using Pearson correlation.</p>
</caption>
<graphic xlink:href="f1000research-5-8064-g0000"></graphic>
</fig>
</sec>
<sec sec-type="methods">
<title>Methods</title>
<sec>
<title>DREAM6 challenge data</title>
<p>The training data composed of DNA sequence for 90 yeast RP promoters with known activities and a test data set of 53 promoters was downloaded from the DREAM challenge website (
<ext-link ext-link-type="uri" xlink:href="https://www.synapse.org//#!Synapse:syn2820426/wiki/71012">https://www.synapse.org//#!Synapse:syn2820426/wiki/71012</ext-link>
). Details of promoter construction are available from Zeevi
<italic>et al.</italic>
2011
<sup>
<xref rid="ref-30" ref-type="bibr">30</xref>
</sup>
and the DREAM website. Briefly, the organizers considered the promoter region as the sequence 1200 bp upstream of a gene or until the nearest gene. Each promoter was linked to a URA3 selection marker and inserted into the same fixed genomic location of a master yeast strain containing the
<italic>YFP</italic>
gene. In total, 110 natural RP promoter strains and 33 strains with synthetically mutated RP promoters were constructed. As a control for experimental variation, all these strains contained a control promoter (TEF2) driving the expression of red fluorescent protein (mCherry). The
<italic>mCherry</italic>
,
<italic>TEF2</italic>
,
<italic>URA3</italic>
,
<italic>RP</italic>
promoter and
<italic>YFP</italic>
were all a single contiguous DNA sequence arranged in that order. Measurements of the
<italic>mCherry</italic>
expression levels and replicates of promoters had very low variation, enabling the distinction between any two promoters with activities differing by as little as ~ 8%. The promoter activity was determined as the amount of YFP fluorescence produced during the exponential growth phase, divided by the integral of the OD during the same period. The promoter activity measures the average amount of YFP produced from each promoter, per cell, per second during the exponential phase.</p>
</sec>
<sec>
<title>Feature extraction</title>
<p>Each promoter sequence was divided into 100 bp non-overlapping windows. The full promoter sequence was considered as another window. To extract information from each of the windows, we considered the frequencies of specific sequences in k-mers (k = 1 to 5), length of homopolymeric stretches DNA, mechanical properties (deformability, bendability and stiffness) and nucleosome binding. K-mer counts were performed using custom scripts. DNA mechanical properties were computed using workflows constructed in the Taverna Workbench version 2.2.0
<sup>
<xref rid="ref-53" ref-type="bibr">53</xref>
</sup>
and BioMoby web-services (accessed in August 2011) imported from the Molecular Modeling and Bioinformatics Group, Barcelona, Spain
<sup>
<xref rid="ref-54" ref-type="bibr">54</xref>
</sup>
. Bendability was estimated based on trinucleotide parameters obtained from DNase I digestion and nucleosome binding data
<sup>
<xref rid="ref-35" ref-type="bibr">35</xref>
</sup>
. Deformability was based on parameters from the analysis of protein-DNA crystallography structures
<sup>
<xref rid="ref-36" ref-type="bibr">36</xref>
</sup>
. Bending stiffness was based on bending free energy using the near-neighbor model
<sup>
<xref rid="ref-37" ref-type="bibr">37</xref>
</sup>
. Nucleosome binding was based on trinucleotide preferences
<sup>
<xref rid="ref-55" ref-type="bibr">55</xref>
</sup>
.</p>
</sec>
<sec>
<title>Feature selection</title>
<p>For each window, feature selection was performed using a linear regression wrapper in the WEKA machine learning toolkit version 3.4
<sup>
<xref rid="ref-56" ref-type="bibr">56</xref>
</sup>
to select feature combinations that are most predictive of promoter activity. Performance of feature combinations was tested using 5- and 10-fold cross validation.</p>
</sec>
<sec>
<title>Machine learning model exploration</title>
<p>Three models implemented in the WEKA toolkit
<sup>
<xref rid="ref-56" ref-type="bibr">56</xref>
</sup>
were considered: SVM regression using sequential minimal optimization (SMO), linear regression and regression trees. Models were trained using 66% of the data and tested using 34%, and included only the features that were selected as important by the linear regression wrapper. Performance was determined using Pearson correlation between model predictions and actual promoter activities computed in R version 2.11.1. The SVM model was selected for refinement based on high performance compared to the other models.</p>
</sec>
<sec>
<title>Application of SVM model to DREAM6 test set</title>
<p>Promoter activities were not available to the participants of the challenge. We applied the ensemble of 501 SVMs built from 500 different training/test sets in which 80% of the data was used in training and 20% in testing and a single SVM validated by 66% training set and 34% testing sets. Each SVM model utilized the 24 features selected by a linear regression wrapper as most predictive of promoter activity. To predict activities of the DREAM6 test set, the 24 features were extracted from the upstream 100 bp sequence for each promoter. Predictions were then made using each of the SVM models and averaged to obtain the final predictions.</p>
</sec>
<sec>
<title>Validation of model by DREAM6 consortium</title>
<p>Predictions from the SVM ensemble were submitted through the DREAM website to the organizers for a blinded evaluation on the test set. The DREAM organizers used four statistics and corresponding
<italic>P</italic>
-values to evaluate the performance on the test set
<sup>
<xref rid="ref-34" ref-type="bibr">34</xref>
</sup>
. Details of the equations used for these statistics have been published separately by the DREAM6 Promoter Prediction Consortium
<sup>
<xref rid="ref-34" ref-type="bibr">34</xref>
</sup>
.</p>
<list list-type="simple">
<list-item>
<label>1.</label>
<p>Pearson correlation between predicted and observed activities for each model submitted: To generate a
<italic>P</italic>
-value for observing a Pearson correlation coefficient of the same magnitude or smaller than that of a given participant, a null distribution was generated by randomly sampling predictions from other teams and repeating this 10,000 times
<sup>
<xref rid="ref-34" ref-type="bibr">34</xref>
</sup>
.</p>
</list-item>
<list-item>
<label>2.</label>
<p>Spearman correlation for participant between ranks of the predicted and actual ranks of promoter activities: A
<italic>P</italic>
-value was then generated using a null distribution obtained from randomly sampling the predictions made by the other participants. The process was repeated 10,000 times
<sup>
<xref rid="ref-34" ref-type="bibr">34</xref>
</sup>
.</p>
</list-item>
<list-item>
<label>3.</label>
<p>Chi-square distance metric measuring the distance between predicted and actual promoter activities: To generate a
<italic>P</italic>
-value for observing a chi-square distance metric of the same magnitude or smaller than that of a given model submission, a null distribution was generated by randomly sampling predictions from other teams and repeating this 10,000 times
<sup>
<xref rid="ref-34" ref-type="bibr">34</xref>
</sup>
.</p>
</list-item>
<list-item>
<label>4.</label>
<p>A rank distance metric measuring the difference in ranks between predicted ranks and actual ranks of promoter activities. A
<italic>P</italic>
-value was generated from a null distribution obtained by randomly sampling predicted ranks from other teams, repeating this 10,000 times.</p>
</list-item>
</list>
<p>The overall score was defined as the product of the four
<italic>P</italic>
-values
<sup>
<xref rid="ref-34" ref-type="bibr">34</xref>
</sup>
. All these scores were computed using R version 2.11.1.</p>
<table-wrap id="T1" orientation="portrait" position="anchor">
<label>Table 1. </label>
<caption>
<title>DNA sequence features predictive of promoter activity.</title>
</caption>
<table frame="hsides" rules="groups" content-type="article-table">
<thead>
<tr>
<th align="left" rowspan="1" colspan="1">DNA feature</th>
<th align="left" rowspan="1" colspan="1">Description</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="1" colspan="1">Mononucleotides</td>
<td rowspan="1" colspan="1">Frequency of G</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Dinucleotides</td>
<td rowspan="1" colspan="1">Frequency of GT</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Trinucleotides</td>
<td rowspan="1" colspan="1">Frequency of 6 trinucleotides</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Tetranucleotides</td>
<td rowspan="1" colspan="1">Frequency of 12 tetranucleotides</td>
</tr>
<tr>
<td rowspan="1" colspan="1">T-tracts</td>
<td rowspan="1" colspan="1">Length of T-tracts</td>
</tr>
<tr>
<td rowspan="1" colspan="1">TA-tracts</td>
<td rowspan="1" colspan="1">Length of TA-tracts</td>
</tr>
<tr>
<td rowspan="1" colspan="1">DNA deformability</td>
<td rowspan="1" colspan="1">Negatively correlated to activity</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec sec-type="results">
<title>Results</title>
<supplementary-material content-type="local-data" id="DS0">
<label>Raw data for 'Prediction of fine-tuned promoter activity from DNA sequence’, Siwo
<italic>et al.</italic>
2016</label>
<caption>
<p>README.txt contains a description of the files.</p>
</caption>
<media xlink:href="f1000research-5-8064-s0000.tgz">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
<permissions>
<copyright-statement>Copyright: © 2016 Siwo G et al.</copyright-statement>
<copyright-year>2016</copyright-year>
<license xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">
<license-p>Data associated with the article are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).</license-p>
</license>
</permissions>
</supplementary-material>
<sec>
<title>Promoter activity is highly predictable using the 100 bp upstream region from TrSS</title>
<p>The challenge organizers provided DNA sequences and promoter activities - the average rate of YFP production from each promoter, per cell per second, during the exponential phase - for 90 RP promoters (training set) and another set of 53 promoters whose activity was withheld from participants (test set)
<sup>
<xref rid="ref-30" ref-type="bibr">30</xref>
</sup>
. We first partitioned the promoter sequences into 100 bp non-overlapping windows, extracted specific DNA features from each window and considered the full promoter sequence as its own window (
<xref ref-type="fig" rid="f1">Figure 1B</xref>
). The features considered were k-mers (k = 1 to 5), length of homopolymeric stretches, nucleosome positioning and DNA mechanical properties (bendability, deformability and stiffness). For each window, we performed feature selection using a linear regression wrapper, then explored three different machine learning methods (SVM, linear regression and regression trees) to learn the association between features in the window and promoter activity (
<xref ref-type="fig" rid="f1">Figure 1B</xref>
). The performance in each window was assessed by Pearson correlation using 5- and 10-fold cross-validations on the training data. We observed very poor correlation (r « 0.5) between predicted and actual promoter activities except when using the window comprising 100 bp from the TrSS. Therefore, we focused the SVM model on this window using 23 features (
<xref ref-type="table" rid="T1">Table 1</xref>
) selected by the linear regression wrapper. A test of this model on 1000 randomized splits of the data (66% training and 34% testing sets) gave an average Pearson correlation of 0.78. The performance of machine learning models can be biased by the training/test data set used. Therefore, to reduce this bias, we obtained an additional 500 SVM models trained on randomly sampled sets of 80% of the data and validated on the remaining 20%. In the DREAM test set (activities for this set were withheld from participants), we used the SVM models to make predictions for each promoter. For each promoter, the predicted activity was the average of predictions across all the ensemble of SVMs based only on the 100 bp upstream of the TrSS. These predicted activities were then submitted to the DREAM consortium for evaluation
<sup>
<xref rid="ref-34" ref-type="bibr">34</xref>
</sup>
.</p>
<p>A total of 21 teams participated in the challenge (
<ext-link ext-link-type="uri" xlink:href="https://www.synapse.org//#!Synapse:syn2820426/wiki/71013">https://www.synapse.org//#!Synapse:syn2820426/wiki/71013</ext-link>
). Predictions from our team had a Spearman correlation of 0.65 (
<italic>P</italic>
= 0.002,
<xref ref-type="fig" rid="f2">Figure 2A</xref>
) to the actual activities, Pearson correlation of 0.65 (
<italic>P</italic>
= 0.003), chi-squared (
<italic>χ</italic>
<sup>2</sup>
) distance metric of 52.62 (
<italic>P</italic>
= 0.508) and
<italic>R</italic>
<sup>2</sup>
statistic measuring the difference in ranks between predicted and actual promoter activities of 35.85 (
<italic>P</italic>
= 0.004). The
<italic>P</italic>
-values were generated from the probability of obtaining a comparable or lower performance using a null distribution in which predictions were made by randomly choosing an activity for each promoter amongst all the 21 participating teams. A combined score based on the negative logarithm (base 10) of the geometric mean of the
<italic>P</italic>
-values for all the 4 scores ranked our team first
<sup>
<xref rid="ref-34" ref-type="bibr">34</xref>
</sup>
(
<xref ref-type="fig" rid="f2">Figure 2B</xref>
), with significant
<italic>P</italic>
-value in three out of four of the statistical tests used for evaluation. Further, although we were not ranked first in the
<italic>χ</italic>
<sup>2</sup>
distance metric, our model performed the most consistently across the multiple assessment metrics, suggesting a robustness of the method. A detailed comparison of the teams was published previously by the DREAM consortium
<sup>
<xref rid="ref-34" ref-type="bibr">34</xref>
</sup>
.</p>
<fig fig-type="figure" id="f2" orientation="portrait" position="float">
<label>Figure 2. </label>
<caption>
<title>Performance of the SVM model on validation test set by the DREAM consortium.</title>
<p>(
<bold>A</bold>
) Correlation between predicted activity by the SVM model and actual promoter activity of 53 promoters whose activity was not available to participants. (
<bold>B</bold>
) Performance of team FIrST relative to other 20 teams based on a combined score.</p>
</caption>
<graphic xlink:href="f1000research-5-8064-g0001"></graphic>
</fig>
</sec>
<sec>
<title>Biological significance of selected features</title>
<p>The final SVM models utilized only 23 features consisting of the frequencies of the mononucleotide G, dinucleotide GT, 6 different trinucleotides, 12 different tetranucleotides, length of poly(dT) and poly(dA-dT) tracts (
<xref ref-type="table" rid="T1">Table 1</xref>
). The relative importance of these features based on weights for the SVM models is provided (see Data availability). The feature with the highest weight was the frequency of the mononucleotide G, correlating negatively with promoter activity. For many of these features there was no clear link to underlying mechanisms of gene regulation. However, it is possible that some of the k-mers may be implicitly linked to transcription factor binding sites. That is, the combination of different k-mer features could capture the binding motifs of specific transcription factors. For example the second most important feature in the SVM was the tetranucleotide ACCC which also occurs in the
<italic>Rap1</italic>
binding site motif
<sup>
<xref rid="ref-39" ref-type="bibr">39</xref>
</sup>
. In addition, frequencies of different k-mers could impact the DNA mechanical structure
<sup>
<xref rid="ref-40" ref-type="bibr">40</xref>
</sup>
. Among the features identified by the SVM model were poly(dT) and poly(dT-dA) tracts which influence the rigidity of DNA
<sup>
<xref rid="ref-24" ref-type="bibr">24</xref>
,
<xref rid="ref-26" ref-type="bibr">26</xref>
</sup>
, thereby directly impacting nucleosome binding. Furthermore, insertion of poly(dT-dA) sequences into promoters can be used to regulate gene expression to a finer degree and at more gradual intervals than could be attained by transcription factor binding site mutations
<sup>
<xref rid="ref-38" ref-type="bibr">38</xref>
</sup>
. Some transcription factors are also highly dependent on the ability of DNA to bend
<sup>
<xref rid="ref-41" ref-type="bibr">41</xref>
<xref rid="ref-43" ref-type="bibr">43</xref>
</sup>
. In particular, TATA binding protein (TBP), which binds to the TATA box, is important for regulating the activity of RP promoters
<sup>
<xref rid="ref-42" ref-type="bibr">42</xref>
,
<xref rid="ref-44" ref-type="bibr">44</xref>
,
<xref rid="ref-45" ref-type="bibr">45</xref>
</sup>
. Another directly biologically relevant feature identified by the SVM was the deformability of DNA
<sup>
<xref rid="ref-36" ref-type="bibr">36</xref>
,
<xref rid="ref-46" ref-type="bibr">46</xref>
</sup>
. Promoters of low activity had more deformable DNA than those of high activity (
<xref ref-type="fig" rid="f3">Figure 3</xref>
,
<italic>P</italic>
= 0.008). This was particularly evident at 40 to 60 bp from the TrSS when comparing the top 20 promoters with the highest versus those with the lowest activity (
<xref ref-type="fig" rid="f3">Figure 3</xref>
).</p>
<fig fig-type="figure" id="f3" orientation="portrait" position="float">
<label>Figure 3. </label>
<caption>
<title>Relationship between protein deformability of promoters and activity.</title>
<p>Among the top 20 promoters with extreme activities (high and low), significant deviation in deformability occurs at the -40 to -60 bp region from the TrSS (T-test
<italic>P</italic>
= 0.008).</p>
</caption>
<graphic xlink:href="f1000research-5-8064-g0002"></graphic>
</fig>
<p>Finally, some of the features may affect mRNA stability, especially given their potential location downstream of the transcription start sites (TSS). Besides sequence features in the 5’UTR that are close to the TSS could affect transcription, translation and mRNA stability.</p>
</sec>
<sec>
<title>Error profile of SVM promoter activity model</title>
<p>Understanding the biases in prediction accuracy could provide biological insights into promoter classes and allow for refinement of models. Therefore, we investigated relationships between the nature of the test promoters and the magnitude of prediction error made by our model. Among the 53 test promoters provided by the DREAM challenge, 20 were natural yeast RP promoters while 33 were variants of these promoters with specific synthetic mutations introduced. These mutations included changes in the binding sites of the TBP,
<italic>Rap1, Fhl</italic>
and
<italic>Sfp1</italic>
, as well as introduction of nucleosome disfavoring sequences and random mutations. At the time of the challenge, participants were not aware of these mutations. The performance of our model on the set of natural promoters was much higher (Pearson correlation r= 0.73,
<italic>P</italic>
= 0.0003) compared to that for the mutated promoters (Pearson correlation r= 0.57,
<italic>P</italic>
= 0.0005). The prediction error was significantly less for natural promoters versus the mutated promoters (Student’s t-test,
<italic>P</italic>
= 0.01,
<xref ref-type="fig" rid="f4">Figure 4A</xref>
). This could partly be due to the composition of the training set, which contained only natural promoters. Similar poor performance was also observed in the models obtained from other teams
<sup>
<xref rid="ref-34" ref-type="bibr">34</xref>
</sup>
. In addition, most of the synthetic mutations were introduced at promoter locations residing outside of the 100 bp region from the TrSS and could not therefore be detected by our model. We also examined the correlation between the observed promoter activity and the prediction error. Promoters of low activity had larger prediction error (Pearson correlation between promoter activity and prediction error r= -0.31,
<italic>P</italic>
= 0.02,
<xref ref-type="fig" rid="f4">Figure 4B</xref>
). Notably, natural promoters had slightly lower activity compared to synthetic promoters (
<italic>P</italic>
= 0.02) so the correlation between activity and prediction error may be a consequence of the low predictability of synthetic promoters. Thus, future models may benefit from data on activities of mutated promoters, which could enable a more accurate modeling of the impact of mutation on specific transcription factor binding sites.</p>
<fig fig-type="figure" id="f4" orientation="portrait" position="float">
<label>Figure 4. </label>
<caption>
<title>Dependence of prediction error on promoter class or activity.</title>
<p>(
<bold>A</bold>
) Natural promoters had a lower prediction error compared to synthetically mutated promoters. (
<bold>B</bold>
) Prediction error is negatively correlated to promoter activity.</p>
</caption>
<graphic xlink:href="f1000research-5-8064-g0003"></graphic>
</fig>
</sec>
</sec>
<sec sec-type="discussion">
<title>Discussion</title>
<p>The quantitative modeling of gene expression has the potential to enhance our understanding of how gene regulation is fine-tuned in natural populations and has implications for the design of predictable gene expression systems. The DREAM6 challenge data set for promoter activity prediction was a unique opportunity to evaluate the predictability of gene expression from its promoter sequence. Given that all promoters were derived from natural yeast RP promoters that are expressed in the exponential phase
<sup>
<xref rid="ref-30" ref-type="bibr">30</xref>
</sup>
, the challenge posed was more targeted towards DNA sequence patterns that fine-tune gene expression rather than simply determine the ‘on/off’ expression status. RP transcription regulation occurs in a highly coordinated manner and is critical for growth, allowing cells to adjust their protein synthesis capacity to physiological needs
<sup>
<xref rid="ref-47" ref-type="bibr">47</xref>
,
<xref rid="ref-48" ref-type="bibr">48</xref>
</sup>
. This is especially crucial as RP gene expression accounts for 50% of transcripts produced by RNA polymerase II
<sup>
<xref rid="ref-49" ref-type="bibr">49</xref>
</sup>
and their dysregulation leads to reduced fitness
<sup>
<xref rid="ref-47" ref-type="bibr">47</xref>
,
<xref rid="ref-48" ref-type="bibr">48</xref>
</sup>
. The yeast genome contains 137 RP genes, of which 19 encode a unique RP and 59 are duplicated. The proper functioning of ribosomes requires that all the ribosome components be expressed in equimolar concentrations
<sup>
<xref rid="ref-50" ref-type="bibr">50</xref>
</sup>
while simultaneously remaining responsive to physiological needs
<sup>
<xref rid="ref-51" ref-type="bibr">51</xref>
,
<xref rid="ref-52" ref-type="bibr">52</xref>
</sup>
. This is potentially challenging given the copy-number differences between the RP genes because high copy number genes generally show increased expression. The regulatory mechanisms underlying this fine-tuned regulation are not known. By accurately predicting the activity of the RP genes using the promoter sequences, we demonstrate that a considerable amount of this information is encoded in the DNA sequence.</p>
<p>It is intriguing that our model did not explicitly use transcription factor binding site information and focused only on the 100 bp upstream region. Some of the features identified by our model may influence transcription factor binding or nucleosomes indirectly, and could even affect mRNA translation. Transcription factors are critical for gene regulation. Their empirically identified binding sites are 6 to 8 bp, theoretically putting an upper bound on the level of regulatory flexibility that can be attained by mutating positions at these sites
<sup>
<xref rid="ref-30" ref-type="bibr">30</xref>
,
<xref rid="ref-38" ref-type="bibr">38</xref>
</sup>
. Cooperation between transcription factors or competition among them
<sup>
<xref rid="ref-15" ref-type="bibr">15</xref>
<xref rid="ref-17" ref-type="bibr">17</xref>
</sup>
, and with nucleosomes
<sup>
<xref rid="ref-23" ref-type="bibr">23</xref>
</sup>
, provides an additional mechanism for fine-tuned gene expression. RP promoters with high activity have not only more nucleosome disfavoring sequences but also characteristic spatial organization of the binding sites for
<italic>Rap1</italic>
,
<italic>Sfp1</italic>
and
<italic>Fhl1</italic>
<sup>
<xref rid="ref-30" ref-type="bibr">30</xref>
</sup>
. The low performance of our model on synthetic promoters containing targeted mutations in transcription factor binding sites and nucleosome disfavoring sequences reinforces the importance of these factors. Consistent with this, the combination of our model and the mechanistically driven model involving transcription factors and nucleosome binding
<sup>
<xref rid="ref-30" ref-type="bibr">30</xref>
</sup>
was more predictive of promoter activity
<sup>
<xref rid="ref-34" ref-type="bibr">34</xref>
</sup>
. Our findings have implications for understanding the fine-tuned regulation of RP genes and engineering desirable activity in synthetic promoters.</p>
</sec>
<sec>
<title>Data availability</title>
<p>The data referenced by this article are under copyright with the following copyright statement: Copyright: © 2016 Siwo G et al.</p>
<p>Data associated with the article are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/"></ext-link>
</p>
<p>
<italic>F1000Research</italic>
: Dataset 1. Raw data for ‘Prediction of fine-tuned promoter activity from DNA sequence’, Siwo
<italic>et al.</italic>
2016,
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.5256/f1000research.7485.d113516">10.5256/f1000research.7485.d113516</ext-link>
<sup>
<xref rid="ref-57" ref-type="bibr">57</xref>
</sup>
</p>
</sec>
</body>
<back>
<ack>
<title>Acknowledgements</title>
<p>This work would not have been possible without the pre-publication provision of data to the DREAM challenge by Dr. Eran Segal and his group at the Weizmann Institute of Science, Israel, and the curation of the challenge by the DREAM committee: Drs. Gustavo Stolovitzky, Pablo Meyer and Rachel Norel at IBM Research, USA. We are grateful to the DREAM6 Promoter Prediction Consortium for the rigorous evaluation of the models.</p>
</ack>
<ref-list>
<ref id="ref-1">
<label>1</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schadt</surname>
<given-names>EE</given-names>
</name>
<name>
<surname>Monks</surname>
<given-names>SA</given-names>
</name>
<name>
<surname>Drake</surname>
<given-names>TA</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Genetics of gene expression surveyed in maize, mouse and man.</article-title>
<source>
<italic>Nature.</italic>
</source>
<year>2003</year>
;
<volume>422</volume>
(
<issue>6929</issue>
):
<fpage>297</fpage>
<lpage>302</lpage>
.
<pub-id pub-id-type="doi">10.1038/nature01434</pub-id>
<pub-id pub-id-type="pmid">12646919</pub-id>
</mixed-citation>
</ref>
<ref id="ref-2">
<label>2</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tirosh</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Reikhav</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sigal</surname>
<given-names>N</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Chromatin regulators as capacitors of interspecies variations in gene expression.</article-title>
<source>
<italic>Mol Syst Biol.</italic>
</source>
<year>2010</year>
;
<volume>6</volume>
:
<fpage>435</fpage>
.
<pub-id pub-id-type="doi">10.1038/msb.2010.84</pub-id>
<pmc-comment>3010112</pmc-comment>
<pub-id pub-id-type="pmid">21119629</pub-id>
</mixed-citation>
</ref>
<ref id="ref-3">
<label>3</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tirosh</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Weinberger</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Carmi</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>A genetic signature of interspecies variations in gene expression.</article-title>
<source>
<italic>Nat Genet.</italic>
</source>
<year>2006</year>
;
<volume>38</volume>
(
<issue>7</issue>
):
<fpage>830</fpage>
<lpage>834</lpage>
.
<pub-id pub-id-type="doi">10.1038/ng1819</pub-id>
<pub-id pub-id-type="pmid">16783381</pub-id>
</mixed-citation>
</ref>
<ref id="ref-4">
<label>4</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Field</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Fondufe-Mittendorf</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Moore</surname>
<given-names>IK</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Gene expression divergence in yeast is coupled to evolution of DNA-encoded nucleosome organization.</article-title>
<source>
<italic>Nat Genet.</italic>
</source>
<year>2009</year>
;
<volume>41</volume>
(
<issue>4</issue>
):
<fpage>438</fpage>
<lpage>445</lpage>
.
<pub-id pub-id-type="doi">10.1038/ng.324</pub-id>
<pmc-comment>2744203</pmc-comment>
<pub-id pub-id-type="pmid">19252487</pub-id>
</mixed-citation>
</ref>
<ref id="ref-5">
<label>5</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gonzales</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Patel</surname>
<given-names>JJ</given-names>
</name>
<name>
<surname>Ponmee</surname>
<given-names>N</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Regulatory hotspots in the malaria parasite genome dictate transcriptional variation.</article-title>
<source>
<italic>PLoS Biol.</italic>
</source>
<year>2008</year>
;
<volume>6</volume>
(
<issue>9</issue>
):
<fpage>e238</fpage>
.
<pub-id pub-id-type="doi">10.1371/journal.pbio.0060238</pub-id>
<pmc-comment>2553844</pmc-comment>
<pub-id pub-id-type="pmid">18828674</pub-id>
</mixed-citation>
</ref>
<ref id="ref-6">
<label>6</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ellis</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Collins</surname>
<given-names>JJ</given-names>
</name>
</person-group>
:
<article-title>Diversity-based, model-guided construction of synthetic gene networks with predicted functions.</article-title>
<source>
<italic>Nat Biotechnol.</italic>
</source>
<year>2009</year>
;
<volume>27</volume>
(
<issue>5</issue>
):
<fpage>465</fpage>
<lpage>471</lpage>
.
<pub-id pub-id-type="doi">10.1038/nbt.1536</pub-id>
<pmc-comment>2680460</pmc-comment>
<pub-id pub-id-type="pmid">19377462</pub-id>
</mixed-citation>
</ref>
<ref id="ref-7">
<label>7</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gertz</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Cohen</surname>
<given-names>BA</given-names>
</name>
</person-group>
:
<article-title>Environment-specific combinatorial
<italic>cis</italic>
-regulation in synthetic promoters.</article-title>
<source>
<italic>Mol Syst Biol.</italic>
</source>
<year>2009</year>
;
<volume>5</volume>
:
<fpage>244</fpage>
.
<pub-id pub-id-type="doi">10.1038/msb.2009.1</pub-id>
<pmc-comment>2657533</pmc-comment>
<pub-id pub-id-type="pmid">19225457</pub-id>
</mixed-citation>
</ref>
<ref id="ref-8">
<label>8</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gertz</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Siggia</surname>
<given-names>ED</given-names>
</name>
<name>
<surname>Cohen</surname>
<given-names>BA</given-names>
</name>
</person-group>
:
<article-title>Analysis of combinatorial
<italic>cis</italic>
-regulation in synthetic and genomic promoters.</article-title>
<source>
<italic>Nature.</italic>
</source>
<year>2009</year>
;
<volume>457</volume>
(
<issue>7226</issue>
):
<fpage>215</fpage>
<lpage>218</lpage>
.
<pub-id pub-id-type="doi">10.1038/nature07521</pub-id>
<pmc-comment>2677908</pmc-comment>
<pub-id pub-id-type="pmid">19029883</pub-id>
</mixed-citation>
</ref>
<ref id="ref-9">
<label>9</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kim</surname>
<given-names>HD</given-names>
</name>
<name>
<surname>Shay</surname>
<given-names>T</given-names>
</name>
<name>
<surname>O’Shea</surname>
<given-names>EK</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Transcriptional regulatory circuits: predicting numbers from alphabets.</article-title>
<source>
<italic>Science.</italic>
</source>
<year>2009</year>
;
<volume>325</volume>
(
<issue>5939</issue>
):
<fpage>429</fpage>
<lpage>432</lpage>
.
<pmc-comment>2745280</pmc-comment>
<pub-id pub-id-type="pmid">19628860</pub-id>
</mixed-citation>
</ref>
<ref id="ref-10">
<label>10</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Segal</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Widom</surname>
<given-names>J</given-names>
</name>
</person-group>
:
<article-title>From DNA sequence to transcriptional behaviour: a quantitative approach.</article-title>
<source>
<italic>Nat Rev Genet.</italic>
</source>
<year>2009</year>
;
<volume>10</volume>
(
<issue>7</issue>
):
<fpage>443</fpage>
<lpage>456</lpage>
.
<pub-id pub-id-type="doi">10.1038/nrg2591</pub-id>
<pmc-comment>2719885</pmc-comment>
<pub-id pub-id-type="pmid">19506578</pub-id>
</mixed-citation>
</ref>
<ref id="ref-11">
<label>11</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Takahashi</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Yamanaka</surname>
<given-names>S</given-names>
</name>
</person-group>
:
<article-title>Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors.</article-title>
<source>
<italic>Cell.</italic>
</source>
<year>2006</year>
;
<volume>126</volume>
(
<issue>4</issue>
):
<fpage>663</fpage>
<lpage>676</lpage>
.
<pub-id pub-id-type="doi">10.1016/j.cell.2006.07.024</pub-id>
<pub-id pub-id-type="pmid">16904174</pub-id>
</mixed-citation>
</ref>
<ref id="ref-12">
<label>12</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kim</surname>
<given-names>HD</given-names>
</name>
<name>
<surname>O’Shea</surname>
<given-names>EK</given-names>
</name>
</person-group>
:
<article-title>A quantitative model of transcription factor-activated gene expression.</article-title>
<source>
<italic>Nat Struct Mol Biol.</italic>
</source>
<year>2008</year>
;
<volume>15</volume>
(
<issue>11</issue>
):
<fpage>1192</fpage>
<lpage>1198</lpage>
.
<pub-id pub-id-type="doi">10.1038/nsmb.1500</pub-id>
<pmc-comment>2696132</pmc-comment>
<pub-id pub-id-type="pmid">18849996</pub-id>
</mixed-citation>
</ref>
<ref id="ref-13">
<label>13</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Irie</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Yamashita</surname>
<given-names>R</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Predicting promoter activities of primary human DNA sequences.</article-title>
<source>
<italic>Nucleic Acids Res.</italic>
</source>
<year>2011</year>
;
<volume>39</volume>
(
<issue>11</issue>
):
<fpage>e75</fpage>
.
<pub-id pub-id-type="doi">10.1093/nar/gkr173</pub-id>
<pmc-comment>3113590</pmc-comment>
<pub-id pub-id-type="pmid">21486745</pub-id>
</mixed-citation>
</ref>
<ref id="ref-14">
<label>14</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cookson</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Liang</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Abecasis</surname>
<given-names>G</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Mapping complex disease traits with global gene expression.</article-title>
<source>
<italic>Nat Rev Genet.</italic>
</source>
<year>2009</year>
;
<volume>10</volume>
(
<issue>3</issue>
):
<fpage>184</fpage>
<lpage>194</lpage>
.
<pub-id pub-id-type="doi">10.1038/nrg2537</pub-id>
<pmc-comment>4550035</pmc-comment>
<pub-id pub-id-type="pmid">19223927</pub-id>
</mixed-citation>
</ref>
<ref id="ref-15">
<label>15</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Karczewski</surname>
<given-names>KJ</given-names>
</name>
<name>
<surname>Tatonetti</surname>
<given-names>NP</given-names>
</name>
<name>
<surname>Landt</surname>
<given-names>SG</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Cooperative transcription factor associations discovered using regulatory variation.</article-title>
<source>
<italic>Proc Natl Acad Sci U S A.</italic>
</source>
<year>2011</year>
;
<volume>108</volume>
(
<issue>32</issue>
):
<fpage>13353</fpage>
<lpage>13358</lpage>
.
<pub-id pub-id-type="doi">10.1073/pnas.1103105108</pub-id>
<pmc-comment>3156166</pmc-comment>
<pub-id pub-id-type="pmid">21828005</pub-id>
</mixed-citation>
</ref>
<ref id="ref-16">
<label>16</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mjolsness</surname>
<given-names>E</given-names>
</name>
</person-group>
:
<article-title>On cooperative quasi-equilibrium models of transcriptional regulation.</article-title>
<source>
<italic>J Bioinform Comput Biol.</italic>
</source>
<year>2007</year>
;
<volume>5</volume>
(
<issue>2B</issue>
):
<fpage>467</fpage>
<lpage>490</lpage>
.
<pub-id pub-id-type="doi">10.1142/S0219720007002874</pub-id>
<pub-id pub-id-type="pmid">17636856</pub-id>
</mixed-citation>
</ref>
<ref id="ref-17">
<label>17</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Das</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Banerjee</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>MQ</given-names>
</name>
</person-group>
:
<article-title>Interacting models of cooperative gene regulation.</article-title>
<source>
<italic>Proc Natl Acad Sci U S A.</italic>
</source>
<year>2004</year>
;
<volume>101</volume>
(
<issue>46</issue>
):
<fpage>16234</fpage>
<lpage>16239</lpage>
.
<pub-id pub-id-type="doi">10.1073/pnas.0407365101</pub-id>
<pmc-comment>528978</pmc-comment>
<pub-id pub-id-type="pmid">15534222</pub-id>
</mixed-citation>
</ref>
<ref id="ref-18">
<label>18</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lam</surname>
<given-names>FH</given-names>
</name>
<name>
<surname>Steger</surname>
<given-names>DJ</given-names>
</name>
<name>
<surname>O’Shea</surname>
<given-names>EK</given-names>
</name>
</person-group>
:
<article-title>Chromatin decouples promoter threshold from dynamic range.</article-title>
<source>
<italic>Nature.</italic>
</source>
<year>2008</year>
;
<volume>453</volume>
(
<issue>7192</issue>
):
<fpage>246</fpage>
<lpage>250</lpage>
.
<pub-id pub-id-type="doi">10.1038/nature06867</pub-id>
<pmc-comment>2435410</pmc-comment>
<pub-id pub-id-type="pmid">18418379</pub-id>
</mixed-citation>
</ref>
<ref id="ref-19">
<label>19</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mirny</surname>
<given-names>LA</given-names>
</name>
</person-group>
:
<article-title>Nucleosome-mediated cooperativity between transcription factors.</article-title>
<source>
<italic>Proc Natl Acad Sci U S A.</italic>
</source>
<year>2010</year>
;
<volume>107</volume>
(
<issue>52</issue>
):
<fpage>22534</fpage>
<lpage>22539</lpage>
.
<pub-id pub-id-type="doi">10.1073/pnas.0913805107</pub-id>
<pmc-comment>3012490</pmc-comment>
<pub-id pub-id-type="pmid">21149679</pub-id>
</mixed-citation>
</ref>
<ref id="ref-20">
<label>20</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>XY</given-names>
</name>
<name>
<surname>Thomas</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sabo</surname>
<given-names>PJ</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>The role of chromatin accessibility in directing the widespread, overlapping patterns of
<italic>Drosophila</italic>
transcription factor binding.</article-title>
<source>
<italic>Genome Biol.</italic>
</source>
<year>2011</year>
;
<volume>12</volume>
(
<issue>4</issue>
):
<fpage>R34</fpage>
.
<pub-id pub-id-type="doi">10.1186/gb-2011-12-4-r34</pub-id>
<pmc-comment>3218860</pmc-comment>
<pub-id pub-id-type="pmid">21473766</pub-id>
</mixed-citation>
</ref>
<ref id="ref-21">
<label>21</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Choi</surname>
<given-names>JK</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>YJ</given-names>
</name>
</person-group>
:
<article-title>Intrinsic variability of gene expression encoded in nucleosome positioning sequences.</article-title>
<source>
<italic>Nat Genet.</italic>
</source>
<year>2009</year>
;
<volume>41</volume>
(
<issue>4</issue>
):
<fpage>498</fpage>
<lpage>503</lpage>
.
<pub-id pub-id-type="doi">10.1038/ng.319</pub-id>
<pub-id pub-id-type="pmid">19252489</pub-id>
</mixed-citation>
</ref>
<ref id="ref-22">
<label>22</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lidor Nili</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Field</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Lubling</surname>
<given-names>Y</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>p53 binds preferentially to genomic regions with high DNA-encoded nucleosome occupancy.</article-title>
<source>
<italic>Genome Res.</italic>
</source>
<year>2010</year>
;
<volume>20</volume>
(
<issue>10</issue>
):
<fpage>1361</fpage>
<lpage>1368</lpage>
.
<pub-id pub-id-type="doi">10.1101/gr.103945.109</pub-id>
<pmc-comment>2945185</pmc-comment>
<pub-id pub-id-type="pmid">20716666</pub-id>
</mixed-citation>
</ref>
<ref id="ref-23">
<label>23</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Raveh-Sadka</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Levo</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Segal</surname>
<given-names>E</given-names>
</name>
</person-group>
:
<article-title>Incorporating nucleosomes into thermodynamic models of transcription regulation.</article-title>
<source>
<italic>Genome Res.</italic>
</source>
<year>2009</year>
;
<volume>19</volume>
(
<issue>8</issue>
):
<fpage>1480</fpage>
<lpage>1496</lpage>
.
<pub-id pub-id-type="doi">10.1101/gr.088260.108</pub-id>
<pmc-comment>2720181</pmc-comment>
<pub-id pub-id-type="pmid">19451592</pub-id>
</mixed-citation>
</ref>
<ref id="ref-24">
<label>24</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Segal</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Widom</surname>
<given-names>J</given-names>
</name>
</person-group>
:
<article-title>Poly(dA:dT) tracts: major determinants of nucleosome organization.</article-title>
<source>
<italic>Curr Opin Struct Biol.</italic>
</source>
<year>2009</year>
;
<volume>19</volume>
(
<issue>1</issue>
):
<fpage>65</fpage>
<lpage>71</lpage>
.
<pub-id pub-id-type="doi">10.1016/j.sbi.2009.01.004</pub-id>
<pmc-comment>2673466</pmc-comment>
<pub-id pub-id-type="pmid">19208466</pub-id>
</mixed-citation>
</ref>
<ref id="ref-25">
<label>25</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kaplan</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Moore</surname>
<given-names>IK</given-names>
</name>
<name>
<surname>Fondufe-Mittendorf</surname>
<given-names>Y</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>The DNA-encoded nucleosome organization of a eukaryotic genome.</article-title>
<source>
<italic>Nature.</italic>
</source>
<year>2009</year>
;
<volume>458</volume>
(
<issue>7236</issue>
):
<fpage>362</fpage>
<lpage>366</lpage>
.
<pub-id pub-id-type="doi">10.1038/nature07667</pub-id>
<pmc-comment>2658732</pmc-comment>
<pub-id pub-id-type="pmid">19092803</pub-id>
</mixed-citation>
</ref>
<ref id="ref-26">
<label>26</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>van der Heijden</surname>
<given-names>T</given-names>
</name>
<name>
<surname>van Vugt</surname>
<given-names>JJ</given-names>
</name>
<name>
<surname>Logie</surname>
<given-names>C</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Sequence-based prediction of single nucleosome positioning and genome-wide nucleosome occupancy.</article-title>
<source>
<italic>Proc Natl Acad Sci U S A.</italic>
</source>
<year>2012</year>
;
<volume>109</volume>
(
<issue>38</issue>
):
<fpage>E2514</fpage>
<lpage>22</lpage>
.
<pub-id pub-id-type="doi">10.1073/pnas.1205659109</pub-id>
<pmc-comment>3458375</pmc-comment>
<pub-id pub-id-type="pmid">22908247</pub-id>
</mixed-citation>
</ref>
<ref id="ref-27">
<label>27</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Segal</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Widom</surname>
<given-names>J</given-names>
</name>
</person-group>
:
<article-title>What controls nucleosome positions?</article-title>
<source>
<italic>Trends Genet.</italic>
</source>
<year>2009</year>
;
<volume>25</volume>
(
<issue>8</issue>
):
<fpage>335</fpage>
<lpage>343</lpage>
.
<pub-id pub-id-type="doi">10.1016/j.tig.2009.06.002</pub-id>
<pmc-comment>2810357</pmc-comment>
<pub-id pub-id-type="pmid">19596482</pub-id>
</mixed-citation>
</ref>
<ref id="ref-28">
<label>28</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lee</surname>
<given-names>CK</given-names>
</name>
<name>
<surname>Shibata</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Rao</surname>
<given-names>B</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Evidence for nucleosome depletion at active regulatory regions genome-wide.</article-title>
<source>
<italic>Nat Genet.</italic>
</source>
<year>2004</year>
;
<volume>36</volume>
(
<issue>8</issue>
):
<fpage>900</fpage>
<lpage>905</lpage>
.
<pub-id pub-id-type="doi">10.1038/ng1400</pub-id>
<pub-id pub-id-type="pmid">15247917</pub-id>
</mixed-citation>
</ref>
<ref id="ref-29">
<label>29</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shivaswamy</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Bhinge</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>Y</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Dynamic remodeling of individual nucleosomes across a eukaryotic genome in response to transcriptional perturbation.</article-title>
<source>
<italic>PLoS Biol.</italic>
</source>
<year>2008</year>
;
<volume>6</volume>
(
<issue>3</issue>
):
<fpage>e65</fpage>
.
<pub-id pub-id-type="doi">10.1371/journal.pbio.0060065</pub-id>
<pmc-comment>2267817</pmc-comment>
<pub-id pub-id-type="pmid">18351804</pub-id>
</mixed-citation>
</ref>
<ref id="ref-30">
<label>30</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zeevi</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Sharon</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Lotan-Pompan</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Compensation for differences in gene copy number among yeast ribosomal proteins is encoded within their promoters.</article-title>
<source>
<italic>Genome Res.</italic>
</source>
<year>2011</year>
;
<volume>21</volume>
(
<issue>12</issue>
):
<fpage>2114</fpage>
<lpage>2128</lpage>
.
<pub-id pub-id-type="doi">10.1101/gr.119669.110</pub-id>
<pmc-comment>3227101</pmc-comment>
<pub-id pub-id-type="pmid">22009988</pub-id>
</mixed-citation>
</ref>
<ref id="ref-31">
<label>31</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yang</surname>
<given-names>YH</given-names>
</name>
<name>
<surname>Dudoit</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Luu</surname>
<given-names>P</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation.</article-title>
<source>
<italic>Nucleic Acids Res.</italic>
</source>
<year>2002</year>
;
<volume>30</volume>
(
<issue>4</issue>
):
<fpage>e15</fpage>
.
<pub-id pub-id-type="doi">10.1093/nar/30.4.e15</pub-id>
<pmc-comment>100354</pmc-comment>
<pub-id pub-id-type="pmid">11842121</pub-id>
</mixed-citation>
</ref>
<ref id="ref-32">
<label>32</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Oshlack</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Wakefield</surname>
<given-names>MJ</given-names>
</name>
</person-group>
:
<article-title>Transcript length bias in RNA-seq data confounds systems biology.</article-title>
<source>
<italic>Biol Direct.</italic>
</source>
<year>2009</year>
;
<volume>4</volume>
:
<fpage>14</fpage>
.
<pub-id pub-id-type="doi">10.1186/1745-6150-4-14</pub-id>
<pmc-comment>2678084</pmc-comment>
<pub-id pub-id-type="pmid">19371405</pub-id>
</mixed-citation>
</ref>
<ref id="ref-33">
<label>33</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kalir</surname>
<given-names>S</given-names>
</name>
<name>
<surname>McClure</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Pabbaraju</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Ordering genes in a flagella pathway by analysis of expression kinetics from living bacteria.</article-title>
<source>
<italic>Science.</italic>
</source>
<year>2001</year>
;
<volume>292</volume>
(
<issue>5524</issue>
):
<fpage>2080</fpage>
<lpage>2083</lpage>
.
<pub-id pub-id-type="doi">10.1126/science.1058758</pub-id>
<pub-id pub-id-type="pmid">11408658</pub-id>
</mixed-citation>
</ref>
<ref id="ref-34">
<label>34</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Meyer</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Siwo</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Zeevi</surname>
<given-names>D</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Inferring gene expression from ribosomal promoter sequences, a crowdsourcing approach.</article-title>
<source>
<italic>Genome Res.</italic>
</source>
<year>2013</year>
;
<volume>23</volume>
(
<issue>11</issue>
):
<fpage>1928</fpage>
<lpage>1937</lpage>
.
<pub-id pub-id-type="doi">10.1101/gr.157420.113</pub-id>
<pmc-comment>3814892</pmc-comment>
<pub-id pub-id-type="pmid">23950146</pub-id>
</mixed-citation>
</ref>
<ref id="ref-35">
<label>35</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brukner</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Sánchez</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Suck</surname>
<given-names>D</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Trinucleotide models for DNA bending propensity: comparison of models based on DNaseI digestion and nucleosome packaging data.</article-title>
<source>
<italic>J Biomol Struct Dyn.</italic>
</source>
<year>1995</year>
;
<volume>13</volume>
(
<issue>2</issue>
):
<fpage>309</fpage>
<lpage>317</lpage>
.
<pub-id pub-id-type="doi">10.1080/07391102.1995.10508842</pub-id>
<pub-id pub-id-type="pmid">8579790</pub-id>
</mixed-citation>
</ref>
<ref id="ref-36">
<label>36</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Olson</surname>
<given-names>WK</given-names>
</name>
<name>
<surname>Gorin</surname>
<given-names>AA</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>XJ</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>DNA sequence-dependent deformability deduced from protein-DNA crystal complexes.</article-title>
<source>
<italic>Proc Natl Acad Sci U S A.</italic>
</source>
<year>1998</year>
;
<volume>95</volume>
(
<issue>19</issue>
):
<fpage>11163</fpage>
<lpage>11168</lpage>
.
<pub-id pub-id-type="doi">10.1073/pnas.95.19.11163</pub-id>
<pmc-comment>21613</pmc-comment>
<pub-id pub-id-type="pmid">9736707</pub-id>
</mixed-citation>
</ref>
<ref id="ref-37">
<label>37</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sivolob</surname>
<given-names>AV</given-names>
</name>
<name>
<surname>Khrapunov</surname>
<given-names>SN</given-names>
</name>
</person-group>
:
<article-title>Translational positioning of nucleosomes on DNA: the role of sequence-dependent isotropic DNA bending stiffness.</article-title>
<source>
<italic>J Mol Biol.</italic>
</source>
<year>1995</year>
;
<volume>247</volume>
(
<issue>5</issue>
):
<fpage>918</fpage>
<lpage>931</lpage>
.
<pub-id pub-id-type="doi">10.1006/jmbi.1994.0190</pub-id>
<pub-id pub-id-type="pmid">7723041</pub-id>
</mixed-citation>
</ref>
<ref id="ref-38">
<label>38</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Raveh-Sadka</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Levo</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Shabi</surname>
<given-names>U</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Manipulating nucleosome disfavoring sequences allows fine-tune regulation of gene expression in yeast.</article-title>
<source>
<italic>Nat Genet.</italic>
</source>
<year>2012</year>
;
<volume>44</volume>
(
<issue>7</issue>
):
<fpage>743</fpage>
<lpage>750</lpage>
.
<pub-id pub-id-type="doi">10.1038/ng.2305</pub-id>
<pub-id pub-id-type="pmid">22634752</pub-id>
</mixed-citation>
</ref>
<ref id="ref-39">
<label>39</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lascaris</surname>
<given-names>RF</given-names>
</name>
<name>
<surname>Mager</surname>
<given-names>WH</given-names>
</name>
<name>
<surname>Planta</surname>
<given-names>RJ</given-names>
</name>
</person-group>
:
<article-title>DNA-binding requirements of the yeast protein Rap1p as selected
<italic>in silico</italic>
from ribosomal protein gene promoter sequences.</article-title>
<source>
<italic>Bioinformatics.</italic>
</source>
<year>1999</year>
;
<volume>15</volume>
(
<issue>4</issue>
):
<fpage>267</fpage>
<lpage>277</lpage>
.
<pub-id pub-id-type="doi">10.1093/bioinformatics/15.4.267</pub-id>
<pub-id pub-id-type="pmid">10320394</pub-id>
</mixed-citation>
</ref>
<ref id="ref-40">
<label>40</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Packer</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Dauncey</surname>
<given-names>MP</given-names>
</name>
<name>
<surname>Hunter</surname>
<given-names>CA</given-names>
</name>
</person-group>
:
<article-title>Sequence-dependent DNA structure: tetranucleotide conformational maps.</article-title>
<source>
<italic>J Mol Biol.</italic>
</source>
<year>2000</year>
;
<volume>295</volume>
(
<issue>1</issue>
):
<fpage>85</fpage>
<lpage>103</lpage>
.
<pub-id pub-id-type="doi">10.1006/jmbi.1999.3237</pub-id>
<pub-id pub-id-type="pmid">10623510</pub-id>
</mixed-citation>
</ref>
<ref id="ref-41">
<label>41</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Laurens</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Rusling</surname>
<given-names>DA</given-names>
</name>
<name>
<surname>Pernstich</surname>
<given-names>C</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>DNA looping by FokI: the impact of twisting and bending rigidity on protein-induced looping dynamics.</article-title>
<source>
<italic>Nucleic Acids Res.</italic>
</source>
<year>2012</year>
;
<volume>40</volume>
(
<issue>11</issue>
):
<fpage>4988</fpage>
<lpage>4997</lpage>
.
<pub-id pub-id-type="doi">10.1093/nar/gks184</pub-id>
<pmc-comment>3367208</pmc-comment>
<pub-id pub-id-type="pmid">22373924</pub-id>
</mixed-citation>
</ref>
<ref id="ref-42">
<label>42</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Starr</surname>
<given-names>DB</given-names>
</name>
<name>
<surname>Hoopes</surname>
<given-names>BC</given-names>
</name>
<name>
<surname>Hawley</surname>
<given-names>DK</given-names>
</name>
</person-group>
:
<article-title>DNA bending is an important component of site-specific recognition by the TATA binding protein.</article-title>
<source>
<italic>J Mol Biol.</italic>
</source>
<year>1995</year>
;
<volume>250</volume>
(
<issue>4</issue>
):
<fpage>434</fpage>
<lpage>446</lpage>
.
<pub-id pub-id-type="doi">10.1006/jmbi.1995.0388</pub-id>
<pub-id pub-id-type="pmid">7616566</pub-id>
</mixed-citation>
</ref>
<ref id="ref-43">
<label>43</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vijayan</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Zuzow</surname>
<given-names>R</given-names>
</name>
<name>
<surname>O’Shea</surname>
<given-names>EK</given-names>
</name>
</person-group>
:
<article-title>Oscillations in supercoiling drive circadian gene expression in cyanobacteria.</article-title>
<source>
<italic>Proc Natl Acad Sci U S A.</italic>
</source>
<year>2009</year>
;
<volume>106</volume>
(
<issue>52</issue>
):
<fpage>22564</fpage>
<lpage>22568</lpage>
.
<pub-id pub-id-type="doi">10.1073/pnas.0912673106</pub-id>
<pmc-comment>2799730</pmc-comment>
<pub-id pub-id-type="pmid">20018699</pub-id>
</mixed-citation>
</ref>
<ref id="ref-44">
<label>44</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Parvin</surname>
<given-names>JD</given-names>
</name>
<name>
<surname>McCormick</surname>
<given-names>RJ</given-names>
</name>
<name>
<surname>Sharp</surname>
<given-names>PA</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Pre-bending of a promoter sequence enhances affinity for the TATA-binding factor.</article-title>
<source>
<italic>Nature.</italic>
</source>
<year>1995</year>
;
<volume>373</volume>
(
<issue>6516</issue>
):
<fpage>724</fpage>
<lpage>727</lpage>
.
<pub-id pub-id-type="doi">10.1038/373724a0</pub-id>
<pub-id pub-id-type="pmid">7854460</pub-id>
</mixed-citation>
</ref>
<ref id="ref-45">
<label>45</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bosio</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Negri</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Dieci</surname>
<given-names>G</given-names>
</name>
</person-group>
:
<article-title>Promoter architectures in the yeast ribosomal expression program.</article-title>
<source>
<italic>Transcription.</italic>
</source>
<year>2011</year>
;
<volume>2</volume>
(
<issue>2</issue>
):
<fpage>71</fpage>
<lpage>77</lpage>
.
<pub-id pub-id-type="doi">10.4161/trns.2.2.14486</pub-id>
<pmc-comment>3062397</pmc-comment>
<pub-id pub-id-type="pmid">21468232</pub-id>
</mixed-citation>
</ref>
<ref id="ref-46">
<label>46</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yonetani</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Kono</surname>
<given-names>H</given-names>
</name>
</person-group>
:
<article-title>Sequence dependencies of DNA deformability and hydration in the minor groove.</article-title>
<source>
<italic>Biophys J.</italic>
</source>
<year>2009</year>
;
<volume>97</volume>
(
<issue>4</issue>
):
<fpage>1138</fpage>
<lpage>1147</lpage>
.
<pub-id pub-id-type="doi">10.1016/j.bpj.2009.05.049</pub-id>
<pmc-comment>2726331</pmc-comment>
<pub-id pub-id-type="pmid">19686662</pub-id>
</mixed-citation>
</ref>
<ref id="ref-47">
<label>47</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Vilardell</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Warner</surname>
<given-names>JR</given-names>
</name>
</person-group>
:
<article-title>An RNA structure involved in feedback regulation of splicing and of translation is critical for biological fitness.</article-title>
<source>
<italic>Proc Natl Acad Sci U S A.</italic>
</source>
<year>1996</year>
;
<volume>93</volume>
(
<issue>4</issue>
):
<fpage>1596</fpage>
<lpage>1600</lpage>
.
<pub-id pub-id-type="doi">10.1073/pnas.93.4.1596</pub-id>
<pmc-comment>39987</pmc-comment>
<pub-id pub-id-type="pmid">8643676</pub-id>
</mixed-citation>
</ref>
<ref id="ref-48">
<label>48</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Deutschbauer</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Jaramillo</surname>
<given-names>DF</given-names>
</name>
<name>
<surname>Proctor</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Mechanisms of haploinsufficiency revealed by genome-wide profiling in yeast.</article-title>
<source>
<italic>Genetics.</italic>
</source>
<year>2005</year>
;
<volume>169</volume>
(
<issue>4</issue>
):
<fpage>1915</fpage>
<lpage>1925</lpage>
.
<pub-id pub-id-type="doi">10.1534/genetics.104.036871</pub-id>
<pmc-comment>1449596</pmc-comment>
<pub-id pub-id-type="pmid">15716499</pub-id>
</mixed-citation>
</ref>
<ref id="ref-49">
<label>49</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Warner</surname>
<given-names>JR</given-names>
</name>
</person-group>
:
<article-title>The economics of ribosome biosynthesis in yeast.</article-title>
<source>
<italic>Trends Biochem Sci.</italic>
</source>
<year>1999</year>
;
<volume>24</volume>
(
<issue>11</issue>
):
<fpage>437</fpage>
<lpage>440</lpage>
.
<pub-id pub-id-type="doi">10.1016/S0968-0004(99)01460-7</pub-id>
<pub-id pub-id-type="pmid">10542411</pub-id>
</mixed-citation>
</ref>
<ref id="ref-50">
<label>50</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Spahn</surname>
<given-names>CM</given-names>
</name>
<name>
<surname>Beckmann</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Eswar</surname>
<given-names>N</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Structure of the 80S ribosome from
<italic>Saccharomyces cerevisiae</italic>
--tRNA-ribosome and subunit-subunit interactions.</article-title>
<source>
<italic>Cell.</italic>
</source>
<year>2001</year>
;
<volume>107</volume>
(
<issue>3</issue>
):
<fpage>373</fpage>
<lpage>386</lpage>
.
<pub-id pub-id-type="doi">10.1016/S0092-8674(01)00539-6</pub-id>
<pub-id pub-id-type="pmid">11701127</pub-id>
</mixed-citation>
</ref>
<ref id="ref-51">
<label>51</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ju</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Warner</surname>
<given-names>JR</given-names>
</name>
</person-group>
:
<article-title>Ribosome synthesis during the growth cycle of
<italic>Saccharomyces cerevisiae</italic>
.</article-title>
<source>
<italic>Yeast.</italic>
</source>
<year>1994</year>
;
<volume>10</volume>
(
<issue>2</issue>
):
<fpage>151</fpage>
<lpage>157</lpage>
.
<pub-id pub-id-type="doi">10.1002/yea.320100203</pub-id>
<pub-id pub-id-type="pmid">8203157</pub-id>
</mixed-citation>
</ref>
<ref id="ref-52">
<label>52</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Causton</surname>
<given-names>HC</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Koh</surname>
<given-names>SS</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Remodeling of yeast genome expression in response to environmental changes.</article-title>
<source>
<italic>Mol Biol Cell.</italic>
</source>
<year>2001</year>
;
<volume>12</volume>
(
<issue>2</issue>
):
<fpage>323</fpage>
<lpage>337</lpage>
.
<pub-id pub-id-type="doi">10.1091/mbc.12.2.323</pub-id>
<pmc-comment>30946</pmc-comment>
<pub-id pub-id-type="pmid">11179418</pub-id>
</mixed-citation>
</ref>
<ref id="ref-53">
<label>53</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Oinn</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Addis</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ferris</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Taverna: a tool for the composition and enactment of bioinformatics workflows.</article-title>
<source>
<italic>Bioinformatics.</italic>
</source>
<year>2004</year>
;
<volume>20</volume>
(
<issue>17</issue>
):
<fpage>3045</fpage>
<lpage>3054</lpage>
.
<pub-id pub-id-type="doi">10.1093/bioinformatics/bth361</pub-id>
<pub-id pub-id-type="pmid">15201187</pub-id>
</mixed-citation>
</ref>
<ref id="ref-54">
<label>54</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Goñi</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Fenollosa</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Pérez</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>DNAlive: a tool for the physical analysis of DNA at the genomic scale.</article-title>
<source>
<italic>Bioinformatics.</italic>
</source>
<year>2008</year>
;
<volume>24</volume>
(
<issue>15</issue>
):
<fpage>1731</fpage>
<lpage>1732</lpage>
.
<pub-id pub-id-type="doi">10.1093/bioinformatics/btn259</pub-id>
<pub-id pub-id-type="pmid">18544548</pub-id>
</mixed-citation>
</ref>
<ref id="ref-55">
<label>55</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Satchwell</surname>
<given-names>SC</given-names>
</name>
<name>
<surname>Drew</surname>
<given-names>HR</given-names>
</name>
<name>
<surname>Travers</surname>
<given-names>AA</given-names>
</name>
</person-group>
:
<article-title>Sequence periodicities in chicken nucleosome core DNA.</article-title>
<source>
<italic>J Mol Biol.</italic>
</source>
<year>1986</year>
;
<volume>191</volume>
(
<issue>4</issue>
):
<fpage>659</fpage>
<lpage>675</lpage>
.
<pub-id pub-id-type="doi">10.1016/0022-2836(86)90452-3</pub-id>
<pub-id pub-id-type="pmid">3806678</pub-id>
</mixed-citation>
</ref>
<ref id="ref-56">
<label>56</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hall</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Frank</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Holmes</surname>
<given-names>G</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>The WEKA data mining software: an update.</article-title>
<source>
<italic>SIGKDD Explor.</italic>
</source>
<year>2009</year>
;
<volume>11</volume>
(
<issue>1</issue>
):
<fpage>10</fpage>
<lpage>18</lpage>
.
<pub-id pub-id-type="doi">10.1145/1656274.1656278</pub-id>
</mixed-citation>
</ref>
<ref id="ref-57">
<label>57</label>
<mixed-citation publication-type="data">
<person-group person-group-type="author">
<name>
<surname>Siwo</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Rider</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
:
<article-title>Dataset 1 in: Prediction of fine-tuned promoter activity from DNA sequence.</article-title>
<source>
<italic>F1000Research.</italic>
</source>
<year>2016</year>
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.5256/f1000research.7485.d113516">Data Source</ext-link>
</mixed-citation>
</ref>
</ref-list>
</back>
<sub-article id="report14225" article-type="peer-review">
<front-stub>
<article-id pub-id-type="doi">10.5256/f1000research.8064.r14225</article-id>
<title-group>
<article-title>Referee response for version 1</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Grau</surname>
<given-names>Jan</given-names>
</name>
<xref ref-type="aff" rid="r14225a1">1</xref>
<role>Referee</role>
</contrib>
<aff id="r14225a1">
<label>1</label>
Institute of Computer Science, Martin Luther University of Halle-Wittenberg, Halle, Germany</aff>
</contrib-group>
<author-notes>
<fn fn-type="COI-statement">
<p>
<bold>Competing interests: </bold>
No competing interests were disclosed.</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>20</day>
<month>6</month>
<year>2016</year>
</pub-date>
<related-article id="d35e3974" related-article-type="peer-reviewed-article" ext-link-type="doi" xlink:href="10.12688/f1000research.7485.1">Version 1</related-article>
<custom-meta-group>
<custom-meta>
<meta-name>recommendation</meta-name>
<meta-value>approve-with-reservations</meta-value>
</custom-meta>
</custom-meta-group>
</front-stub>
<body>
<p>The authors present FIrST, an approach for predicting promoter activity from sequence, which won one of the DREAM6 challenges. FIrST is using only simple sequence features in a limited range (100 bp) upstream of the translation start site for making its predictions, which distinguishes it from several other approaches in this field.</p>
<p> Prediction results are convincing and the method appears to be sound. However, currently the method is not described detailed enough. In addition, I have a few further major and minor concerns regarding the current version of the manuscript:</p>
<p> Major comments:
<list list-type="order">
<list-item>
<p>In the list of features described in section "Feature extraction", some seem redundant to me. For instance, the trinucleotide parameters for bendability are just computed from the k-mers for k=3. Also nucleosome binding prediction was based on trinucleotide preference. Please explain why it may be useful to also include those 3-mer-derived features in addition to the 3-mers themselves.</p>
</list-item>
<list-item>
<p>The description of methods in section "Machine learning model exploration" is too coarse. Please provide more detail on the SVMs, linear regression, and regression trees employed. It also remains unclear if the scales of features are normalized somehow, before their values are provided to the SVM.</p>
</list-item>
<list-item>
<p>No details are given on the selected 3-mers and 4-mers (Table 1). Please provide a list of the specific k-mers selected by FIrST. It may also be reasonable to discuss potential biological reasons for their importance (as partly covered for TATA-boxes on page 6).</p>
</list-item>
<list-item>
<p>Considering Fig. 3, I wondered if the difference in deformability may be related to transcription initiation. Or, stated differently, might we observe an ever clearer signal if all sequences (and their deformability profiles) would be aligned by the transcription start site (TSS) instead of the TrSS? One idea in the same direction, which could contribute to the novelty of the manuscript, would be to evaluate similar profiles (of sequences aligned to TSS or TrSS) for all features found to be informative by FIrST. For instance, one could expect to see something like general fluctuations of G/C content, or the TATA-box in 4-mer profiles as a spike approx. 35 bp before the TSS. From my perspective, this might improve the novelty of the manuscripts and the interpretation of features.</p>
</list-item>
</list>
</p>
<p> Minor comments:
<list list-type="order">
<list-item>
<p>The data from the DREAM6 challenge only consider a special subset of genes (ribosomal genes) and only in yeast. It is unclear if the features derived by the authors' method would also be informative for higher eukaryotes. I understand that this question cannot be finally answered from the DREAM6 data, but the authors might comment on this issue.</p>
</list-item>
<list-item>
<p>Figure 1B remains a bit unclear. In the caption and the main text, the authors explain that they use non-overlapping 100bp sub-sequences. However, from the figure it rather seems that they consider upstream sequences of 300 bp, 200 bp and 100 bp (and the full promoter sequence) relative to the translation start site. Please clarify.</p>
</list-item>
<list-item>
<p>In section "DREAM6 challenge data" of "Methods", the authors refer to "the sequence 1200 bp upstream of a gene", where "upstream of the translation start site" (as in the remainder of the text) would be more specific.</p>
</list-item>
<list-item>
<p>In section "Feature extraction", the authors explain that "each promoter sequence was divided into 100 bp non-overlapping windows", while in the previous section they explain that the full 1200 bp sequences do not extend over the nearest gene. From my understanding, this may result in some of the sequences being shorter than 1200 bp, and their length might not be dividable by 100. Please explain how such cases are handled.</p>
</list-item>
<list-item>
<p>At the end of section "Validation of model by DREAM6 consortium", the authors explain that "the overall score was defined as the product of the four P-values", whereas later they explain that -log
<sub>10</sub>
of the geometric mean of the p-values was used as the overall measure. Although bot definition are equivalent with respect to the resulting ranking, I would suggest to provide one consistent definition of the overall score.</p>
</list-item>
<list-item>
<p>From the manuscript it did not become fully clear if the TA-tracts (also termed poly(dA-dT) tracts in some parts of the manuscript) are tracts of poly "A or T" or tracts of poly "AT"-dinucleotides.</p>
</list-item>
<list-item>
<p>In section "Error profile of SVM promoter activity model", the authors explain that natural promoters had (slightly) lower activity than synthetic promoters and that the prediction error of the SVM is lower for natural promoters. However, I did not get the idea, why this should explain that low activity genes had larger prediction errors.</p>
</list-item>
<list-item>
<p>In section "Error profile of SVM promoter activity model", the authors explain that one reason why FIrST did not perform well for synthetic promoters is that most mutations had been introduced outside the 100 bp range considered by FIrST. However, this reasoning partly contradicts the claim of the authors that most of the transcriptional activity may be explained from the sequence in that 100 bp window. If this would truly be the case, mutations outside this range should have only minor effects.</p>
</list-item>
<list-item>
<p>In the Discussion, the authors mention that TF binding motifs are 6 to 8 bp in length. While this may be true for several yeast TFs, it is not correct for eukaryotes in general and motifs may be wider than 10 bp.</p>
</list-item>
</list>
</p>
<p>I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
</body>
</sub-article>
<sub-article id="report12380" article-type="peer-review">
<front-stub>
<article-id pub-id-type="doi">10.5256/f1000research.8064.r12380</article-id>
<title-group>
<article-title>Referee response for version 1</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Ruan</surname>
<given-names>Jianhua</given-names>
</name>
<xref ref-type="aff" rid="r12380a1">1</xref>
<role>Referee</role>
</contrib>
<aff id="r12380a1">
<label>1</label>
Department of Computer Science, The University of Texas at San Antonio, San Antonio, TX, USA</aff>
</contrib-group>
<author-notes>
<fn fn-type="COI-statement">
<p>
<bold>Competing interests: </bold>
No competing interests were disclosed.</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>7</day>
<month>4</month>
<year>2016</year>
</pub-date>
<related-article id="d35e4079" related-article-type="peer-reviewed-article" ext-link-type="doi" xlink:href="10.12688/f1000research.7485.1">Version 1</related-article>
<custom-meta-group>
<custom-meta>
<meta-name>recommendation</meta-name>
<meta-value>approve-with-reservations</meta-value>
</custom-meta>
</custom-meta-group>
</front-stub>
<body>
<p>This article describes the winning method of the DREAM6 promoter activity prediction challenge. While a meta analysis of the competing methods participated in the challenge has already been published (Meyer
<italic>et al.</italic>
, 2013), this article provides more details of the winning method and some additional analysis of the predictive model, which may lead to better understanding of the predictability of gene transcription. While its contribution is undoubtable, this article should be revised to address several issues:</p>
<p>Major issues:
<list list-type="order">
<list-item>
<p>The 23 features utilized by the SVM model (as well as their coefficients in the model) is not provided explicitly in the main text nor in the supplement file. Table 1 in the main text shows that 6 trinucleotides and 12 tetranucleotides are important features, but it is nowhere to be found which tri- and tetra-nucleotides they are. For lengths of T or TA-tracts, the supplement file shows several different values, including mean, median and stdev. It is unclear which one is actually used by the SVM model. Similarly, supplement file shows 79 values for deformability and it is unknown which one is used. </p>
</list-item>
<list-item>
<p>In the case of ranking the features by their SVM coefficients, the authors need to clarify if the feature values were normalized prior to model building, as these features are on very different scales and if not normalized the ranking of the coefficients are not very meaningful.</p>
</list-item>
<list-item>
<p>The main conclusion in the subsection "Error profile of SVM promoter activity model" do not seem to make sense. First, promoters of low activity had larger prediction error. Then the authors stated that natural promoters had lower activity. This seems to contradict with their observation that the prediction error was significantly less for natural promoters than for mutated promoters.</p>
</list-item>
</list>
</p>
<p>Minor issues:
<list list-type="order">
<list-item>
<p>Authors only mentioned that feature selection was done in WEKA with wrapper. More details need to be given. For example, what was the selection strategy used by the wrapper, e.g., exhaustive search, greedy forward search, backward search, or other types of heuristics? </p>
</list-item>
<list-item>
<p>What is the purpose of first training 1000 SVM classifiers using 66% of data as training and 34% as testing, and then another 500 SVM classifiers using 80% as training and 20% as testing?</p>
</list-item>
</list>
</p>
<p>I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="rep-ref-12380-1">
<label>1</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Meyer</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Siwo</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Zeevi</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Sharon</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Norel</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Segal</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Stolovitzky</surname>
<given-names>G</given-names>
</name>
</person-group>
:
<article-title>Inferring gene expression from ribosomal promoter sequences, a crowdsourcing approach.</article-title>
<source>
<italic>Genome Res</italic>
</source>
.
<year>2013</year>
;
<volume>23</volume>
(
<issue>11</issue>
) :
<elocation-id>10.1101/gr.157420.113</elocation-id>
<fpage>1928</fpage>
-
<lpage>37</lpage>
<pub-id pub-id-type="doi">10.1101/gr.157420.113</pub-id>
<pub-id pub-id-type="pmid">23950146</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</sub-article>
<sub-article id="report12530" article-type="peer-review">
<front-stub>
<article-id pub-id-type="doi">10.5256/f1000research.8064.r12530</article-id>
<title-group>
<article-title>Referee response for version 1</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Pavlidis</surname>
<given-names>Paul</given-names>
</name>
<xref ref-type="aff" rid="r12530a1">1</xref>
<role>Referee</role>
</contrib>
<aff id="r12530a1">
<label>1</label>
Centre for High-Throughput Biology and Department of Psychiatry, University of British Columbia, Vancouver, BC, Canada</aff>
</contrib-group>
<author-notes>
<fn fn-type="COI-statement">
<p>
<bold>Competing interests: </bold>
No competing interests were disclosed.</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>2</day>
<month>3</month>
<year>2016</year>
</pub-date>
<related-article id="d35e4232" related-article-type="peer-reviewed-article" ext-link-type="doi" xlink:href="10.12688/f1000research.7485.1">Version 1</related-article>
<custom-meta-group>
<custom-meta>
<meta-name>recommendation</meta-name>
<meta-value>approve</meta-value>
</custom-meta>
</custom-meta-group>
</front-stub>
<body>
<p>Siwo
<italic>et al</italic>
. give a detailed report of their entry to the DREAM promoter activity prediction assessment, conducted in 2011. The paper describing the results of the assessment appeared in 2013 (Meyer
<italic>et al</italic>
.), and the entry from Siwo
<italic>et al</italic>
. (“FiRST”) was the top-performer overall. Meyer
<italic>et al</italic>
. gives few details about the specific methods, mentioning only that the FiRST entry used an SVM and did not use TF binding site motif information. Here it is clarified that FiRST is a simple method that uses only part of the sequence and the most prominent features were about nucleotide content.</p>
<p>Because it is perhaps a little eye-opening (even embarrassing, depending on one’s point of view) that the best method in the assessment is so simple, this paper is an important footnote to Meyer
<italic>et al</italic>
. but it could be fleshed out further to get at what is going on. My suggestions for revisions are to give more detail about the properties of the sequences used and the relationship to performance.</p>
<p>FiRST predicts from only the 100 bases of sequence upstream of the translation start (which was considered as part of the promoter by DREAM; I note this is not “upstream of the gene” as described by Siwo
<italic>et al</italic>
. in the methods section), and that their predictions were dominated by the effect of a simple measure of G content. Siwo
<italic>et al</italic>
. report that they did worse at predicting the synthetically mutated promoters (this was apparently not true overall across methods as reported by Meyer
<italic>et al</italic>
.). In Meyer
<italic>et al</italic>
., adding tf binding information to FiRST improved performance.</p>
<p>The authors mention this, but the most important reason that FiRST does poorly at predicting the synthetic mutations seems to be that most of the mutations (seems to be 29 out of 33, based on Table 1 of Meyer
<italic>et al</italic>
.) are not in the 100 bp window used. That is, because in most cases these synthetic sequences were (as I understand it) identical in features to other examples while having different activities, for the purposes of FiRST, they could only introduce prediction errors. In light of this fact the rest of the speculation about why performance varied in this way seems extraneous.</p>
<p>It would also be useful to see more detailed information on the sequences used (e.g., the G content or other features), and the prediction error in each case. How well does one predict using G content alone? This might all be reconstructed from the data supplement helpfully provided, but the authors should consider providing the analysis. It also seems reasonable to ask for more details about the performance of other sequence windows.</p>
<p>The main other missing piece from this paper is any discussion or evidence that the method works beyond the narrow confines of the DREAM setup. Even for the RP genes, does it make a useful prediction, that increasing the G content of RP promoters in that 100 bp window will decrease promoter activity? I am fine with leaving this as “future work” but it would be worth mentioning.</p>
<p>Figure 2B is apparently the same as part of Figure 1E from Meyer
<italic>et al</italic>
., except FiRST is not marked (actually there is a small difference in the values plotted; the combined score for FiRST looks closer to 2 than the 1.87 reported and plotted in Meyer
<italic>et al</italic>
.). The authors should clearly cite Meyer
<italic>et al</italic>
. in the figure caption as the source of the data for this figure, or simply point the readers to Meyer
<italic>et al</italic>
., or else explain where the data came from if not from Meyer
<italic>et al</italic>
.</p>
<p>I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
</body>
</sub-article>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000897 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000897 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:4916984
   |texte=   Prediction of fine-tuned promoter activity from DNA sequence
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:27347373" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021