Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Modeling DNA affinity landscape through two-round support vector regression with weighted degree kernels

Identifieur interne : 000964 ( Pmc/Corpus ); précédent : 000963; suivant : 000965

Modeling DNA affinity landscape through two-round support vector regression with weighted degree kernels

Auteurs : Xiaolei Wang ; Hiroyuki Kuwahara ; Xin Gao

Source :

RBID : PMC:4305984

Abstract

Background

A quantitative understanding of interactions between transcription factors (TFs) and their DNA binding sites is key to the rational design of gene regulatory networks. Recent advances in high-throughput technologies have enabled high-resolution measurements of protein-DNA binding affinity. Importantly, such experiments revealed the complex nature of TF-DNA interactions, whereby the effects of nucleotide changes on the binding affinity were observed to be context dependent. A systematic method to give high-quality estimates of such complex affinity landscapes is, thus, essential to the control of gene expression and the advance of synthetic biology.

Results

Here, we propose a two-round prediction method that is based on support vector regression (SVR) with weighted degree (WD) kernels. In the first round, a WD kernel with shifts and mismatches is used with SVR to detect the importance of subsequences with different lengths at different positions. The subsequences identified as important in the first round are then fed into a second WD kernel to fit the experimentally measured affinities. To our knowledge, this is the first attempt to increase the accuracy of the affinity prediction by applying two rounds of string kernels and by identifying a small number of crucial k-mers. The proposed method was tested by predicting the binding affinity landscape of Gcn4p in Saccharomyces cerevisiae using datasets from HiTS-FLIP. Our method explicitly identified important subsequences and showed significant performance improvements when compared with other state-of-the-art methods. Based on the identified important subsequences, we discovered two surprisingly stable 10-mers and one sensitive 10-mer which were not reported before. Further test on four other TFs in S. cerevisiae demonstrated the generality of our method.

Conclusion

We proposed in this paper a two-round method to quantitatively model the DNA binding affinity landscape. Since the ability to modify genetic parts to fine-tune gene expression rates is crucial to the design of biological systems, such a tool may play an important role in the success of synthetic biology going forward.


Url:
DOI: 10.1186/1752-0509-8-S5-S5
PubMed: 25605483
PubMed Central: 4305984

Links to Exploration step

PMC:4305984

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Modeling DNA affinity landscape through two-round support vector regression with weighted degree kernels</title>
<author>
<name sortKey="Wang, Xiaolei" sort="Wang, Xiaolei" uniqKey="Wang X" first="Xiaolei" last="Wang">Xiaolei Wang</name>
<affiliation>
<nlm:aff id="I1">Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), 23955 Thuwal, Kingdom of Saudi Arabia</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I2">Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), 23955 Thuwal, Kingdom of Saudi Arabia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kuwahara, Hiroyuki" sort="Kuwahara, Hiroyuki" uniqKey="Kuwahara H" first="Hiroyuki" last="Kuwahara">Hiroyuki Kuwahara</name>
<affiliation>
<nlm:aff id="I1">Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), 23955 Thuwal, Kingdom of Saudi Arabia</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I2">Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), 23955 Thuwal, Kingdom of Saudi Arabia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Gao, Xin" sort="Gao, Xin" uniqKey="Gao X" first="Xin" last="Gao">Xin Gao</name>
<affiliation>
<nlm:aff id="I1">Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), 23955 Thuwal, Kingdom of Saudi Arabia</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I2">Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), 23955 Thuwal, Kingdom of Saudi Arabia</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">25605483</idno>
<idno type="pmc">4305984</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4305984</idno>
<idno type="RBID">PMC:4305984</idno>
<idno type="doi">10.1186/1752-0509-8-S5-S5</idno>
<date when="2014">2014</date>
<idno type="wicri:Area/Pmc/Corpus">000964</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000964</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Modeling DNA affinity landscape through two-round support vector regression with weighted degree kernels</title>
<author>
<name sortKey="Wang, Xiaolei" sort="Wang, Xiaolei" uniqKey="Wang X" first="Xiaolei" last="Wang">Xiaolei Wang</name>
<affiliation>
<nlm:aff id="I1">Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), 23955 Thuwal, Kingdom of Saudi Arabia</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I2">Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), 23955 Thuwal, Kingdom of Saudi Arabia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kuwahara, Hiroyuki" sort="Kuwahara, Hiroyuki" uniqKey="Kuwahara H" first="Hiroyuki" last="Kuwahara">Hiroyuki Kuwahara</name>
<affiliation>
<nlm:aff id="I1">Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), 23955 Thuwal, Kingdom of Saudi Arabia</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I2">Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), 23955 Thuwal, Kingdom of Saudi Arabia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Gao, Xin" sort="Gao, Xin" uniqKey="Gao X" first="Xin" last="Gao">Xin Gao</name>
<affiliation>
<nlm:aff id="I1">Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), 23955 Thuwal, Kingdom of Saudi Arabia</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I2">Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), 23955 Thuwal, Kingdom of Saudi Arabia</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Systems Biology</title>
<idno type="eISSN">1752-0509</idno>
<imprint>
<date when="2014">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>A quantitative understanding of interactions between transcription factors (TFs) and their DNA binding sites is key to the rational design of gene regulatory networks. Recent advances in high-throughput technologies have enabled high-resolution measurements of protein-DNA binding affinity. Importantly, such experiments revealed the complex nature of TF-DNA interactions, whereby the effects of nucleotide changes on the binding affinity were observed to be context dependent. A systematic method to give high-quality estimates of such complex affinity landscapes is, thus, essential to the control of gene expression and the advance of synthetic biology.</p>
</sec>
<sec>
<title>Results</title>
<p>Here, we propose a two-round prediction method that is based on support vector regression (SVR) with weighted degree (WD) kernels. In the first round, a WD kernel with shifts and mismatches is used with SVR to detect the importance of subsequences with different lengths at different positions. The subsequences identified as important in the first round are then fed into a second WD kernel to fit the experimentally measured affinities. To our knowledge, this is the first attempt to increase the accuracy of the affinity prediction by applying two rounds of string kernels and by identifying a small number of crucial k-mers. The proposed method was tested by predicting the binding affinity landscape of Gcn4p in
<italic>Saccharomyces cerevisiae </italic>
using datasets from HiTS-FLIP. Our method explicitly identified important subsequences and showed significant performance improvements when compared with other state-of-the-art methods. Based on the identified important subsequences, we discovered two surprisingly stable 10-mers and one sensitive 10-mer which were not reported before. Further test on four other TFs in
<italic>S. cerevisiae </italic>
demonstrated the generality of our method.</p>
</sec>
<sec>
<title>Conclusion</title>
<p>We proposed in this paper a two-round method to quantitatively model the DNA binding affinity landscape. Since the ability to modify genetic parts to fine-tune gene expression rates is crucial to the design of biological systems, such a tool may play an important role in the success of synthetic biology going forward.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Endy, D" uniqKey="Endy D">D Endy</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Purnick, Pem" uniqKey="Purnick P">PEM Purnick</name>
</author>
<author>
<name sortKey="Weiss, R" uniqKey="Weiss R">R Weiss</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kuwahara, H" uniqKey="Kuwahara H">H Kuwahara</name>
</author>
<author>
<name sortKey="Fan, M" uniqKey="Fan M">M Fan</name>
</author>
<author>
<name sortKey="Wang, S" uniqKey="Wang S">S Wang</name>
</author>
<author>
<name sortKey="Gao, X" uniqKey="Gao X">X Gao</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Alberts, B" uniqKey="Alberts B">B Alberts</name>
</author>
<author>
<name sortKey="Johnson, A" uniqKey="Johnson A">A Johnson</name>
</author>
<author>
<name sortKey="Lewis, J" uniqKey="Lewis J">J Lewis</name>
</author>
<author>
<name sortKey="Raff, M" uniqKey="Raff M">M Raff</name>
</author>
<author>
<name sortKey="Roberts, K" uniqKey="Roberts K">K Roberts</name>
</author>
<author>
<name sortKey="Walter, P" uniqKey="Walter P">P Walter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Berger, Mf" uniqKey="Berger M">MF Berger</name>
</author>
<author>
<name sortKey="Bulyk, Ml" uniqKey="Bulyk M">ML Bulyk</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gottardo, R" uniqKey="Gottardo R">R Gottardo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Maerkl, Sj" uniqKey="Maerkl S">SJ Maerkl</name>
</author>
<author>
<name sortKey="Quake, Sr" uniqKey="Quake S">SR Quake</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fordyce, Pm" uniqKey="Fordyce P">PM Fordyce</name>
</author>
<author>
<name sortKey="Gerber, D" uniqKey="Gerber D">D Gerber</name>
</author>
<author>
<name sortKey="Tran, D" uniqKey="Tran D">D Tran</name>
</author>
<author>
<name sortKey="Zheng, J" uniqKey="Zheng J">J Zheng</name>
</author>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Derisi, Jl" uniqKey="Derisi J">JL DeRisi</name>
</author>
<author>
<name sortKey="Quake, Sr" uniqKey="Quake S">SR Quake</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nutiu, R" uniqKey="Nutiu R">R Nutiu</name>
</author>
<author>
<name sortKey="Friedman, Rc" uniqKey="Friedman R">RC Friedman</name>
</author>
<author>
<name sortKey="Luo, S" uniqKey="Luo S">S Luo</name>
</author>
<author>
<name sortKey="Khrebtukova, I" uniqKey="Khrebtukova I">I Khrebtukova</name>
</author>
<author>
<name sortKey="Silva, D" uniqKey="Silva D">D Silva</name>
</author>
<author>
<name sortKey="Li, R" uniqKey="Li R">R Li</name>
</author>
<author>
<name sortKey="Zhang, L" uniqKey="Zhang L">L Zhang</name>
</author>
<author>
<name sortKey="Schroth, Gp" uniqKey="Schroth G">GP Schroth</name>
</author>
<author>
<name sortKey="Burge, Cb" uniqKey="Burge C">CB Burge</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Alleyne, Tm" uniqKey="Alleyne T">TM Alleyne</name>
</author>
<author>
<name sortKey="Pe A Castillo, L" uniqKey="Pe A Castillo L">L Peña-Castillo</name>
</author>
<author>
<name sortKey="Badis, G" uniqKey="Badis G">G Badis</name>
</author>
<author>
<name sortKey="Talukder, S" uniqKey="Talukder S">S Talukder</name>
</author>
<author>
<name sortKey="Berger, Mf" uniqKey="Berger M">MF Berger</name>
</author>
<author>
<name sortKey="Gehrke, Ar" uniqKey="Gehrke A">AR Gehrke</name>
</author>
<author>
<name sortKey="Philippakis, Aa" uniqKey="Philippakis A">AA Philippakis</name>
</author>
<author>
<name sortKey="Bulyk, Ml" uniqKey="Bulyk M">ML Bulyk</name>
</author>
<author>
<name sortKey="Morris, Qd" uniqKey="Morris Q">QD Morris</name>
</author>
<author>
<name sortKey="Hughes, Tr" uniqKey="Hughes T">TR Hughes</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Weirauch, Mt" uniqKey="Weirauch M">MT Weirauch</name>
</author>
<author>
<name sortKey="Cote, A" uniqKey="Cote A">A Cote</name>
</author>
<author>
<name sortKey="Norel, R" uniqKey="Norel R">R Norel</name>
</author>
<author>
<name sortKey="Annala, M" uniqKey="Annala M">M Annala</name>
</author>
<author>
<name sortKey="Zhao, Y" uniqKey="Zhao Y">Y Zhao</name>
</author>
<author>
<name sortKey="Riley, Tr" uniqKey="Riley T">TR Riley</name>
</author>
<author>
<name sortKey="Saez Rodriguez, J" uniqKey="Saez Rodriguez J">J Saez-Rodriguez</name>
</author>
<author>
<name sortKey="Cokelaer, T" uniqKey="Cokelaer T">T Cokelaer</name>
</author>
<author>
<name sortKey="Vedenko, A" uniqKey="Vedenko A">A Vedenko</name>
</author>
<author>
<name sortKey="Talukder, S" uniqKey="Talukder S">S Talukder</name>
</author>
<author>
<name sortKey="Dreamc" uniqKey="Dreamc">DREAMC</name>
</author>
<author>
<name sortKey="Bussemaker, Hj" uniqKey="Bussemaker H">HJ Bussemaker</name>
</author>
<author>
<name sortKey="Morris, Qd" uniqKey="Morris Q">QD Morris</name>
</author>
<author>
<name sortKey="Bulyk, Ml" uniqKey="Bulyk M">ML Bulyk</name>
</author>
<author>
<name sortKey="Stolovitzky, G" uniqKey="Stolovitzky G">G Stolovitzky</name>
</author>
<author>
<name sortKey="Hughes, Tr" uniqKey="Hughes T">TR Hughes</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Berg, Og" uniqKey="Berg O">OG Berg</name>
</author>
<author>
<name sortKey="Von Hippel, Ph" uniqKey="Von Hippel P">PH von Hippel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stormo, Gd" uniqKey="Stormo G">GD Stormo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Benos, Pv" uniqKey="Benos P">PV Benos</name>
</author>
<author>
<name sortKey="Bulyk, Ml" uniqKey="Bulyk M">ML Bulyk</name>
</author>
<author>
<name sortKey="Stormo, Gd" uniqKey="Stormo G">GD Stormo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bulyk, Ml" uniqKey="Bulyk M">ML Bulyk</name>
</author>
<author>
<name sortKey="Johnson, Plf" uniqKey="Johnson P">PLF Johnson</name>
</author>
<author>
<name sortKey="Church, Gm" uniqKey="Church G">GM Church</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, Xs" uniqKey="Liu X">XS Liu</name>
</author>
<author>
<name sortKey="Brutlag, Dl" uniqKey="Brutlag D">DL Brutlag</name>
</author>
<author>
<name sortKey="Liu, Js" uniqKey="Liu J">JS Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Foat, Bc" uniqKey="Foat B">BC Foat</name>
</author>
<author>
<name sortKey="Morozov, Av" uniqKey="Morozov A">AV Morozov</name>
</author>
<author>
<name sortKey="Bussemaker, Hj" uniqKey="Bussemaker H">HJ Bussemaker</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, X" uniqKey="Chen X">X Chen</name>
</author>
<author>
<name sortKey="Hughes, Tr" uniqKey="Hughes T">TR Hughes</name>
</author>
<author>
<name sortKey="Morris, Q" uniqKey="Morris Q">Q Morris</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Agius, P" uniqKey="Agius P">P Agius</name>
</author>
<author>
<name sortKey="Arvey, A" uniqKey="Arvey A">A Arvey</name>
</author>
<author>
<name sortKey="Chang, W" uniqKey="Chang W">W Chang</name>
</author>
<author>
<name sortKey="Noble, Ws" uniqKey="Noble W">WS Noble</name>
</author>
<author>
<name sortKey="Leslie, C" uniqKey="Leslie C">C Leslie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lee, D" uniqKey="Lee D">D Lee</name>
</author>
<author>
<name sortKey="Karchin, R" uniqKey="Karchin R">R Karchin</name>
</author>
<author>
<name sortKey="Beer, Ma" uniqKey="Beer M">MA Beer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Annala, M" uniqKey="Annala M">M Annala</name>
</author>
<author>
<name sortKey="Laurila, K" uniqKey="Laurila K">K Laurila</name>
</author>
<author>
<name sortKey="L Hdesm Ki, H" uniqKey="L Hdesm Ki H">H Lähdesmäki</name>
</author>
<author>
<name sortKey="Nykter, M" uniqKey="Nykter M">M Nykter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vapnik, V" uniqKey="Vapnik V">V Vapnik</name>
</author>
<author>
<name sortKey="Chervonenkis, A" uniqKey="Chervonenkis A">A Chervonenkis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jebara, T" uniqKey="Jebara T">T Jebara</name>
</author>
<author>
<name sortKey="Kondor, R" uniqKey="Kondor R">R Kondor</name>
</author>
<author>
<name sortKey="Howard, A" uniqKey="Howard A">A Howard</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Xie, B" uniqKey="Xie B">B Xie</name>
</author>
<author>
<name sortKey="Jankovic, Br" uniqKey="Jankovic B">BR Jankovic</name>
</author>
<author>
<name sortKey="Bajic, Vb" uniqKey="Bajic V">VB Bajic</name>
</author>
<author>
<name sortKey="Song, L" uniqKey="Song L">L Song</name>
</author>
<author>
<name sortKey="Gao, X" uniqKey="Gao X">X Gao</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Leslie, C" uniqKey="Leslie C">C Leslie</name>
</author>
<author>
<name sortKey="Eskin, E" uniqKey="Eskin E">E Eskin</name>
</author>
<author>
<name sortKey="Noble, Ws" uniqKey="Noble W">WS Noble</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="R Tsch, G" uniqKey="R Tsch G">G Rätsch</name>
</author>
<author>
<name sortKey="Sonnenburg, S" uniqKey="Sonnenburg S">S Sonnenburg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Leslie, Cs" uniqKey="Leslie C">CS Leslie</name>
</author>
<author>
<name sortKey="Eskin, E" uniqKey="Eskin E">E Eskin</name>
</author>
<author>
<name sortKey="Cohen, A" uniqKey="Cohen A">A Cohen</name>
</author>
<author>
<name sortKey="Weston, J" uniqKey="Weston J">J Weston</name>
</author>
<author>
<name sortKey="Noble, Ws" uniqKey="Noble W">WS Noble</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="R Tsch, G" uniqKey="R Tsch G">G Rätsch</name>
</author>
<author>
<name sortKey="Sonnenburg, S" uniqKey="Sonnenburg S">S Sonnenburg</name>
</author>
<author>
<name sortKey="Sch Lkopf, B" uniqKey="Sch Lkopf B">B Schälkopf</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mohapatra, A" uniqKey="Mohapatra A">A Mohapatra</name>
</author>
<author>
<name sortKey="Mishra, Pm" uniqKey="Mishra P">PM Mishra</name>
</author>
<author>
<name sortKey="Padhy, S" uniqKey="Padhy S">S Padhy</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sonnenburg, S" uniqKey="Sonnenburg S">S Sonnenburg</name>
</author>
<author>
<name sortKey="Zien, A" uniqKey="Zien A">A Zien</name>
</author>
<author>
<name sortKey="Philips, P" uniqKey="Philips P">P Philips</name>
</author>
<author>
<name sortKey="R Tsch, G" uniqKey="R Tsch G">G Rätsch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Natarajan, K" uniqKey="Natarajan K">K Natarajan</name>
</author>
<author>
<name sortKey="Meyer, Mr" uniqKey="Meyer M">MR Meyer</name>
</author>
<author>
<name sortKey="Jackson, Bm" uniqKey="Jackson B">BM Jackson</name>
</author>
<author>
<name sortKey="Slade, D" uniqKey="Slade D">D Slade</name>
</author>
<author>
<name sortKey="Roberts, C" uniqKey="Roberts C">C Roberts</name>
</author>
<author>
<name sortKey="Hinnebusch, Ag" uniqKey="Hinnebusch A">AG Hinnebusch</name>
</author>
<author>
<name sortKey="Marton, Mj" uniqKey="Marton M">MJ Marton</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hope, Ia" uniqKey="Hope I">IA Hope</name>
</author>
<author>
<name sortKey="Struhl, K" uniqKey="Struhl K">K Struhl</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hill, De" uniqKey="Hill D">DE Hill</name>
</author>
<author>
<name sortKey="Hope, Ia" uniqKey="Hope I">IA Hope</name>
</author>
<author>
<name sortKey="Macke, Jp" uniqKey="Macke J">JP Macke</name>
</author>
<author>
<name sortKey="Struhl, K" uniqKey="Struhl K">K Struhl</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sellers, Jw" uniqKey="Sellers J">JW Sellers</name>
</author>
<author>
<name sortKey="Vincent, Ac" uniqKey="Vincent A">AC Vincent</name>
</author>
<author>
<name sortKey="Struhl, K" uniqKey="Struhl K">K Struhl</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hinnebusch, Ag" uniqKey="Hinnebusch A">AG Hinnebusch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhu, C" uniqKey="Zhu C">C Zhu</name>
</author>
<author>
<name sortKey="Byers, Kjrp" uniqKey="Byers K">KJRP Byers</name>
</author>
<author>
<name sortKey="Mccord, Rp" uniqKey="Mccord R">RP McCord</name>
</author>
<author>
<name sortKey="Shi, Z" uniqKey="Shi Z">Z Shi</name>
</author>
<author>
<name sortKey="Berger, Mf" uniqKey="Berger M">MF Berger</name>
</author>
<author>
<name sortKey="Newburger, De" uniqKey="Newburger D">DE Newburger</name>
</author>
<author>
<name sortKey="Saulrieta, K" uniqKey="Saulrieta K">K Saulrieta</name>
</author>
<author>
<name sortKey="Smith, Z" uniqKey="Smith Z">Z Smith</name>
</author>
<author>
<name sortKey="Shah, Mv" uniqKey="Shah M">MV Shah</name>
</author>
<author>
<name sortKey="Radhakrishnan, M" uniqKey="Radhakrishnan M">M Radhakrishnan</name>
</author>
<author>
<name sortKey="Philippakis, Aa" uniqKey="Philippakis A">AA Philippakis</name>
</author>
<author>
<name sortKey="Hu, Y" uniqKey="Hu Y">Y Hu</name>
</author>
<author>
<name sortKey="De Masi, F" uniqKey="De Masi F">F De Masi</name>
</author>
<author>
<name sortKey="Pacek, M" uniqKey="Pacek M">M Pacek</name>
</author>
<author>
<name sortKey="Rolfs, A" uniqKey="Rolfs A">A Rolfs</name>
</author>
<author>
<name sortKey="Murthy, T" uniqKey="Murthy T">T Murthy</name>
</author>
<author>
<name sortKey="Labaer, J" uniqKey="Labaer J">J Labaer</name>
</author>
<author>
<name sortKey="Bulyk, Ml" uniqKey="Bulyk M">ML Bulyk</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sonnenburg, S" uniqKey="Sonnenburg S">S Sonnenburg</name>
</author>
<author>
<name sortKey="Zien, A" uniqKey="Zien A">A Zien</name>
</author>
<author>
<name sortKey="R Tsch, G" uniqKey="R Tsch G">G Rätsch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sonnenburg, S" uniqKey="Sonnenburg S">S Sonnenburg</name>
</author>
<author>
<name sortKey="Schweikert, G" uniqKey="Schweikert G">G Schweikert</name>
</author>
<author>
<name sortKey="Philips, P" uniqKey="Philips P">P Philips</name>
</author>
<author>
<name sortKey="Behr, J" uniqKey="Behr J">J Behr</name>
</author>
<author>
<name sortKey="R Tsch, G" uniqKey="R Tsch G">G Rätsch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schweikert, G" uniqKey="Schweikert G">G Schweikert</name>
</author>
<author>
<name sortKey="Zien, A" uniqKey="Zien A">A Zien</name>
</author>
<author>
<name sortKey="Zeller, G" uniqKey="Zeller G">G Zeller</name>
</author>
<author>
<name sortKey="Behr, J" uniqKey="Behr J">J Behr</name>
</author>
<author>
<name sortKey="Dieterich, C" uniqKey="Dieterich C">C Dieterich</name>
</author>
<author>
<name sortKey="Ong, Cs" uniqKey="Ong C">CS Ong</name>
</author>
<author>
<name sortKey="Philips, P" uniqKey="Philips P">P Philips</name>
</author>
<author>
<name sortKey="De Bona, F" uniqKey="De Bona F">F De Bona</name>
</author>
<author>
<name sortKey="Hartmann, L" uniqKey="Hartmann L">L Hartmann</name>
</author>
<author>
<name sortKey="Bohlen, A" uniqKey="Bohlen A">A Bohlen</name>
</author>
<author>
<name sortKey="Kruger, N" uniqKey="Kruger N">N Krüger</name>
</author>
<author>
<name sortKey="Sonnenburg, S" uniqKey="Sonnenburg S">S Sonnenburg</name>
</author>
<author>
<name sortKey="R Tsch, G" uniqKey="R Tsch G">G Rätsch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Saeys, Y" uniqKey="Saeys Y">Y Saeys</name>
</author>
<author>
<name sortKey="Abeel, T" uniqKey="Abeel T">T Abeel</name>
</author>
<author>
<name sortKey="Degroeve, S" uniqKey="Degroeve S">S Degroeve</name>
</author>
<author>
<name sortKey="Van, Y De Peer" uniqKey="Van Y">Y de Peer Van</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article" xml:lang="en">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Syst Biol</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Syst Biol</journal-id>
<journal-title-group>
<journal-title>BMC Systems Biology</journal-title>
</journal-title-group>
<issn pub-type="epub">1752-0509</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">25605483</article-id>
<article-id pub-id-type="pmc">4305984</article-id>
<article-id pub-id-type="publisher-id">1752-0509-8-S5-S5</article-id>
<article-id pub-id-type="doi">10.1186/1752-0509-8-S5-S5</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Modeling DNA affinity landscape through two-round support vector regression with weighted degree kernels</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" id="A1">
<name>
<surname>Wang</surname>
<given-names>Xiaolei</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<xref ref-type="aff" rid="I2">2</xref>
</contrib>
<contrib contrib-type="author" id="A2">
<name>
<surname>Kuwahara</surname>
<given-names>Hiroyuki</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<xref ref-type="aff" rid="I2">2</xref>
</contrib>
<contrib contrib-type="author" corresp="yes" id="A3">
<name>
<surname>Gao</surname>
<given-names>Xin</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<xref ref-type="aff" rid="I2">2</xref>
<email>xin.gao@kaust.edu.sa</email>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), 23955 Thuwal, Kingdom of Saudi Arabia</aff>
<aff id="I2">
<label>2</label>
Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), 23955 Thuwal, Kingdom of Saudi Arabia</aff>
<pub-date pub-type="collection">
<year>2014</year>
</pub-date>
<pub-date pub-type="epub">
<day>12</day>
<month>12</month>
<year>2014</year>
</pub-date>
<volume>8</volume>
<issue>Suppl 5</issue>
<supplement>
<named-content content-type="supplement-title">Proceedings of the 25th International Conference on Genome Informatics (GIW/ISCB-Asia): Systems Biology</named-content>
<named-content content-type="supplement-editor">Tetsuo Shibuya and Chuan Yi Tang</named-content>
<named-content content-type="supplement-sponsor">Publication of this supplement has not been supported by sponsorship. Information about the source of funding for publication charges can be found in the individual articles. Articles are based on presentations made at The 25th International Conference on Genome Informatics (GIW/ISCB-Asia). The peer review process was overseen by the Supplement Editors in accordance with BioMed Central's peer review guidelines for supplements. The Supplement Editors declare they have no competing interests.</named-content>
</supplement>
<fpage>S5</fpage>
<lpage>S5</lpage>
<permissions>
<copyright-statement>Copyright © 2014 Wang et al.; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2014</copyright-year>
<copyright-holder>Wang et al.; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0">http://creativecommons.org/licenses/by/4.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1752-0509/8/S5/S5"></self-uri>
<abstract>
<sec>
<title>Background</title>
<p>A quantitative understanding of interactions between transcription factors (TFs) and their DNA binding sites is key to the rational design of gene regulatory networks. Recent advances in high-throughput technologies have enabled high-resolution measurements of protein-DNA binding affinity. Importantly, such experiments revealed the complex nature of TF-DNA interactions, whereby the effects of nucleotide changes on the binding affinity were observed to be context dependent. A systematic method to give high-quality estimates of such complex affinity landscapes is, thus, essential to the control of gene expression and the advance of synthetic biology.</p>
</sec>
<sec>
<title>Results</title>
<p>Here, we propose a two-round prediction method that is based on support vector regression (SVR) with weighted degree (WD) kernels. In the first round, a WD kernel with shifts and mismatches is used with SVR to detect the importance of subsequences with different lengths at different positions. The subsequences identified as important in the first round are then fed into a second WD kernel to fit the experimentally measured affinities. To our knowledge, this is the first attempt to increase the accuracy of the affinity prediction by applying two rounds of string kernels and by identifying a small number of crucial k-mers. The proposed method was tested by predicting the binding affinity landscape of Gcn4p in
<italic>Saccharomyces cerevisiae </italic>
using datasets from HiTS-FLIP. Our method explicitly identified important subsequences and showed significant performance improvements when compared with other state-of-the-art methods. Based on the identified important subsequences, we discovered two surprisingly stable 10-mers and one sensitive 10-mer which were not reported before. Further test on four other TFs in
<italic>S. cerevisiae </italic>
demonstrated the generality of our method.</p>
</sec>
<sec>
<title>Conclusion</title>
<p>We proposed in this paper a two-round method to quantitatively model the DNA binding affinity landscape. Since the ability to modify genetic parts to fine-tune gene expression rates is crucial to the design of biological systems, such a tool may play an important role in the success of synthetic biology going forward.</p>
</sec>
</abstract>
<kwd-group>
<kwd>binding affinity</kwd>
<kwd>protein-DNA interaction</kwd>
<kwd>support vector regression</kwd>
<kwd>weighted degree kernel</kwd>
</kwd-group>
<conference>
<conf-date>15-17 December 2014</conf-date>
<conf-name>The 25th International Conference on Genome Informatics (GIW/ISCB-Asia)</conf-name>
<conf-loc>Tokyo, Japan</conf-loc>
</conference>
</article-meta>
</front>
<body>
<sec sec-type="intro">
<title>Introduction</title>
<p>A major goal of synthetic biology is to manipulate existing organisms so as to construct new biological systems that possess desired functions [
<xref ref-type="bibr" rid="B1">1</xref>
-
<xref ref-type="bibr" rid="B3">3</xref>
]. The ability to adjust the expression of genes precisely is then necessary if the behavior of a synthetic biological system is to be fine-tuned for a given functional specification. Since the initiation of transcription is one of the most important steps in gene regulation [
<xref ref-type="bibr" rid="B4">4</xref>
], a quantitative understanding of interactions between transcription factors (TFs) and their DNA binding sites is key to predicting the dynamics of gene circuits. However, the mechanistic characterization of intricate TF-DNA interactions from first principles of biochemistry still remains elusive. Consequently, the use of phenomenological models to characterize the affinity of TF-DNA interactions is essential for the rational design of synthetic gene circuits.</p>
<p>There are several high-throughput methods available to perform high-resolution measurements of protein-DNA interactions. These include
<italic>protein-binding microarrays </italic>
(PBMs), which characterize
<italic>in vitro </italic>
binding specificities of TFs for relatively short DNA sequences [
<xref ref-type="bibr" rid="B5">5</xref>
], and
<italic>chromatin immunoprecipitation </italic>
(ChIP)-based methods, which, in a cell-type specific fashion, can map genome-wide binding locations of TFs, provided that the relevant antibodies are available [
<xref ref-type="bibr" rid="B6">6</xref>
].
<italic>Mechanically induced trapping of molecular interactions </italic>
(MITOMI) was developed which is capable of detecting low affinity TF-DNA binding using a microfluidic device [
<xref ref-type="bibr" rid="B7">7</xref>
]. They further developed a second-generation of MITOMI that was capable of measuring thousands of interactions in parallel [
<xref ref-type="bibr" rid="B8">8</xref>
]. Recently,
<italic>high-throughput sequencing - fluorescent ligand interaction profiling </italic>
(HiTS-FLIP) has been developed [
<xref ref-type="bibr" rid="B9">9</xref>
]. This is a method based on a second-generation DNA sequencing technology, which allows for hundreds of millions of
<italic>in vitro </italic>
measurements of TF-DNA binding affinities and provides a more comprehensive picture of the binding affinity landscapes of TFs. Further, HiTS-FLIP permits measurements of longer sequences of DNA, making possible analysis of complex binding affinity landscapes of dimeric and oligomeric TFs.</p>
<p>Various statistical and computational models have been developed to characterize binding affinity [
<xref ref-type="bibr" rid="B10">10</xref>
,
<xref ref-type="bibr" rid="B11">11</xref>
]. The most commonly used one of these is the position weight matrix (PWM) [
<xref ref-type="bibr" rid="B12">12</xref>
,
<xref ref-type="bibr" rid="B13">13</xref>
]. The basic PWM model aligns the DNA binding sequences and calculates the weights of different nucleotides at different positions within the alignment. There are different variants of PWM. All PWM models, however, assume that mononucleotides at different positions contribute independently. Although such models provided relatively accurate predictions for short binding motifs [
<xref ref-type="bibr" rid="B14">14</xref>
], alternative models have been developed to encode short-range information through building much larger matrices for subsequences (
<italic>k</italic>
-mers) [
<xref ref-type="bibr" rid="B5">5</xref>
,
<xref ref-type="bibr" rid="B15">15</xref>
]. MDscan [
<xref ref-type="bibr" rid="B16">16</xref>
], for example, combined the word enumeration and position-specific weight matrix updating to iteratively approximate maximum a posteriori scoring function. Foat et al. proposed MatrixREDUCE [
<xref ref-type="bibr" rid="B17">17</xref>
], a statistical mechanics method that took high-throughput measurements of binding affinity as inputs and performed a least-squares fit to estimate the position specific affinity matrix that contained the relative energy contribution of each nucleotide at different positions. RankMotif++ [
<xref ref-type="bibr" rid="B18">18</xref>
] learned PWM motif models by maximum likelihood estimation of a probabilistic model for binding preferences.</p>
<p>Recent advances in high-throughput measurements of binding affinity and machine learning techniques have enabled the direct learning of the DNA-binding affinity landscape of TFs. A special case of the mismatch string kernel, di-mismatch kernel, has been proposed [
<xref ref-type="bibr" rid="B19">19</xref>
] that maps each binding sequence to a kernel space depending on similarity to all unique
<italic>k</italic>
-mers in the training data, for a fixed
<italic>k </italic>
and certain allowance of mismatches. Spectrum kernel has been applied to classify mammalian enhancers [
<xref ref-type="bibr" rid="B20">20</xref>
]. More recently, Annala et al. [
<xref ref-type="bibr" rid="B21">21</xref>
] proposed a linear model (HK
<italic></italic>
ME) that assumes the binding affinity to be the sum of the contributions of certain subsequences of the binding sequences. Their method was the best performer in the Dialogue for Reverse Engineering Assessment and Methods 5 (DREAM5) transcription factor/DNA motif recognition competition.</p>
<p>Despite the significant advances made in computational methods for modeling DNA-binding affinity landscapes, there are still bottlenecks that continue to hamper the progress. The major one is that although most existing methods assume that
<italic>k</italic>
-mers make an important contribution to the binding affinity, there is no systematic overview provided of the importance of all
<italic>k</italic>
-mers with different lengths at different positions. For example, the PWM model assumes mononucleotides to be independent [
<xref ref-type="bibr" rid="B13">13</xref>
], whereas the di-mismatch kernel reveals which of the
<italic>k</italic>
-mers are important for a specific length, but cannot determine at which positions the
<italic>k</italic>
-mers are important or whether those with shorter length also contribute to the affinity. Similarly, the HK
<italic></italic>
ME method used
<italic>k</italic>
-mers of specific lengths, i.e., all of the 4-6-mers as well as those 7- and 8-mers with the highest median intensity. This fact makes it difficult for the existing methods to well capture the important
<italic>k</italic>
-mers and their important positions.</p>
<p>In this paper, we propose a two-round support vector regression (SVR) method based on weighted degree (WD) kernels to overcome this bottleneck. In the first round, a WD kernel with shifts and mismatches is used with SVR to detect the importance of subsequences with different lengths at different positions. The identified subsequences are then fed into a second WD kernel to fit the experimentally measured affinities. Our method can systematically explore all the subsequences up to a certain length at all positions, and the results can easily be interpreted by users. We have applied this method to predict the binding affinity landscape of Gcn4p in
<italic>Saccharomyces cerevisiae </italic>
by using datasets from HiTS-FLIP. Through comparison with state-of-the-art predictors, we demonstrate that our method can provide significant improvements. We also demonstrate that our method can be straightforwardly used to visualize the importance of any
<italic>k</italic>
-mer at any position in binding sequences, thus to gain insights in the design of binding site sequences. Furthermore, we predict two high-affinity 10-mer motifs that are significantly more stable than the previously reported binding motifs. To evaluate the generalization power of our method, we further test it on the datasets from MITOMI2.0 [
<xref ref-type="bibr" rid="B8">8</xref>
] of four other TFs in
<italic>S. cerevisiae</italic>
, i.e., Cbf1p, Cin5p, Pho4p and Yap1p. Our method shows consistent improvements over state-of-the-art methods.</p>
</sec>
<sec sec-type="materials|methods">
<title>Materials and methods</title>
<sec>
<title>Support vector regression</title>
<p>Support vector regression (SVR) is a supervised regression model [
<xref ref-type="bibr" rid="B22">22</xref>
]. Given training data {(
<italic>x</italic>
<sub>1</sub>
<italic>, y</italic>
<sub>1</sub>
)
<italic>, . . . </italic>
, (
<italic>x
<sub>n</sub>
, y
<sub>n</sub>
</italic>
)
<italic>}</italic>
, where
<italic>x
<sub>i</sub>
</italic>
is a high-dimensional feature vector and y
<sub>i </sub>
is the corresponding real-valued variable, the goal is to learn a function
<italic>f </italic>
(
<italic>x</italic>
) that has at most ε deviation from
<italic>y</italic>
<sub>i </sub>
for all the
<italic>x
<sub>i</sub>
</italic>
. In the case of linear SVR, we have</p>
<p>
<disp-formula>
<mml:math id="M1" name="1752-0509-8-S5-S5-i15" overflow="scroll">
<mml:mi>f</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo></mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo></mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>b</mml:mi>
<mml:mo>,</mml:mo>
</mml:math>
</disp-formula>
</p>
<p>where 〈·,·〉 denotes the inner product and
<italic>w </italic>
denotes the coefficient vector trained in linear SVR. The optimization formulation of SVR is thus</p>
<p>
<disp-formula>
<mml:math id="M2" name="1752-0509-8-S5-S5-i1" overflow="scroll">
<mml:mtext class="textsf" mathvariant="sans-serif">min</mml:mtext>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:mo class="MathClass-rel">|</mml:mo>
<mml:mo class="MathClass-rel">|</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo class="MathClass-rel">|</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo class="MathClass-rel">|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>C</mml:mi>
<mml:mstyle displaystyle="true">
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo class="MathClass-op"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:munderover>
</mml:mstyle>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>ξ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mstyle displaystyle="true">
<mml:msubsup>
<mml:mrow>
<mml:mi>ξ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo class="MathClass-bin">*</mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:mstyle>
<mml:mo>)</mml:mo>
<mml:mo class="MathClass-punc">,</mml:mo>
</mml:math>
</disp-formula>
</p>
<p>
<disp-formula>
<mml:math id="M3" name="1752-0509-8-S5-S5-i2" overflow="scroll">
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext class="textsf" mathvariant="sans-serif">subject to</mml:mtext>
</mml:mstyle>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:mfenced open="{">
<mml:mrow>
<mml:mtable class="array" columnlines="none" equalcolumns="false" equalrows="false">
<mml:mtr>
<mml:mtd columnalign="left">
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:mrow>
<mml:mo class="MathClass-open"></mml:mo>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close"></mml:mo>
</mml:mrow>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:mi>b</mml:mi>
<mml:mo class="MathClass-rel"></mml:mo>
<mml:mo class="MathClass-rel"></mml:mo>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>ξ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-punc">,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mo class="MathClass-open"></mml:mo>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close"></mml:mo>
</mml:mrow>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>b</mml:mi>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-rel"></mml:mo>
<mml:mo class="MathClass-rel"></mml:mo>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>ξ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo class="MathClass-bin">*</mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:msub>
<mml:mrow>
<mml:mi>ξ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>ξ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo class="MathClass-bin">*</mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo class="MathClass-rel"></mml:mo>
<mml:mn>0</mml:mn>
<mml:mo class="MathClass-punc">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where ξ
<sub>i </sub>
are
<inline-formula>
<mml:math id="M4" name="1752-0509-8-S5-S5-i3" overflow="scroll">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>ξ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo class="MathClass-bin">*</mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>
are slack variables and
<italic>C ></italic>
0 is the trade-off constant. If the relationship between
<italic>x
<sub>i </sub>
</italic>
and
<italic>y
<sub>i </sub>
</italic>
is non-linear, SVR can perform non-linear regression by kernel tricks which implicitly map
<italic>x
<sub>i </sub>
</italic>
to higher-dimensional feature spaces, i.e.,
<italic>f </italic>
(
<italic>x</italic>
) = 〈
<italic>w</italic>
, Φ(
<italic>x</italic>
)〉 +
<italic>b</italic>
, where Φ(
<italic>x</italic>
) is a kernel mapping representation.</p>
</sec>
<sec>
<title>String kernels</title>
<p>String kernels are positive definite kernel functions defined on pairs of strings. The basic idea of string kernels is to map each string to a high-dimensional feature space and calculate the inner product of the two feature vectors. In other words, string kernels measure the similarity between pairs of strings. The more similar the two strings,
<bold>a </bold>
and
<bold>b</bold>
, the higher will be the value of the string kernel,
<italic>K</italic>
(
<bold>a</bold>
,
<bold>b</bold>
).</p>
<p>The two main types of string kernels are distribution-based kernels and
<italic>k </italic>
-mer-based kernels. Distribution-based kernels attempt to model uncertainties using random variables. Such kernels include the probability product kernel [
<xref ref-type="bibr" rid="B23">23</xref>
] and the spectral latent kernel [
<xref ref-type="bibr" rid="B24">24</xref>
]. Such kernels, however, require relatively long input sequences to capture the statistically meaningful distributions of subsequences.</p>
<p>In contrast to the distribution-based kernels, the
<italic>k </italic>
-mer-based string kernels essentially count all subsequences in the two sequences with lengths up to a pre-defined value and use these as features. A
<italic>k</italic>
-mer is a length-
<italic>k </italic>
subsequence in a sequence,
<bold>a</bold>
. There are several types of
<italic>k</italic>
-mer-based kernels each of which handles different assumptions. The spectrum kernel [
<xref ref-type="bibr" rid="B25">25</xref>
] maps each sequence into a feature space where each dimension counts the number of occurrences of a particular subsequence. The underlying assumption of the spectrum kernel is that the positions at which the subsequences occur are not important, rather the frequencies of their occurrences are the informative factor. Unlike the spectrum kernel, which is position independent, the weighted degree (WD) kernel [
<xref ref-type="bibr" rid="B26">26</xref>
] compares matches of subsequences at exact positions.</p>
<p>Specifically, let
<bold>a</bold>
<italic>
<sub>k </sub>
</italic>
(
<italic>i</italic>
) denote a
<italic>k</italic>
-mer starting at position
<italic>i </italic>
of
<bold>a</bold>
. A
<italic>d</italic>
-th degree WD kernel of two sequences,
<bold>a </bold>
and
<bold>b</bold>
, of length
<italic>L </italic>
is defined as</p>
<p>
<disp-formula id="bmcM1">
<label>(1)</label>
<mml:math id="M5" name="1752-0509-8-S5-S5-i4" overflow="scroll">
<mml:mi>k</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo class="MathClass-op"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:munderover>
</mml:mstyle>
<mml:msub>
<mml:mrow>
<mml:mi>β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mstyle displaystyle="true">
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo class="MathClass-op"></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
</mml:mstyle>
<mml:mi mathvariant="double-struck">I</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo class="MathClass-close">]</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-punc">,</mml:mo>
</mml:math>
</disp-formula>
</p>
<p>where
<italic>β
<sub>k </sub>
</italic>
are weights for different
<italic>k</italic>
-mers and
<inline-formula>
<mml:math id="M6" name="1752-0509-8-S5-S5-i12" overflow="scroll">
<mml:mrow>
<mml:mi mathvariant="double-struck">I</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">[</mml:mo>
<mml:mrow>
<mml:mo class="MathClass-bin"></mml:mo>
</mml:mrow>
<mml:mo class="MathClass-close">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
is an indicator function such that it is 1 when the condition inside the bracket is true and 0 otherwise. From Eq. 1, the computational complexity of calculating WD kernel between two sequences
<bold>a </bold>
and
<bold>b </bold>
is
<italic>O</italic>
(
<italic>dL</italic>
).</p>
<p>To incorporate alternations in DNA sequences caused by the substitution, deletion, and insertion into the WD kernel, WD kernel with shifts and mismatches was proposed [
<xref ref-type="bibr" rid="B27">27</xref>
-
<xref ref-type="bibr" rid="B29">29</xref>
] as follows:
<disp-formula id="bmcM2">
<label>(2)</label>
<mml:math id="M7" name="1752-0509-8-S5-S5-i5" overflow="scroll">
<mml:mrow>
<mml:mtable class="gathered">
<mml:mtr>
<mml:mtd>
<mml:mi>K</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:munderover>
</mml:mstyle>
<mml:mstyle displaystyle="true">
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>M</mml:mi>
</mml:mrow>
</mml:munderover>
</mml:mstyle>
<mml:msub>
<mml:mrow>
<mml:mi>β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mstyle displaystyle="true">
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
</mml:mstyle>
<mml:mi>γ</mml:mi>
<mml:mi>i</mml:mi>
<mml:mstyle displaystyle="true">
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-rel"><</mml:mo>
<mml:mi>L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:munderover>
</mml:mstyle>
<mml:msub>
<mml:mrow>
<mml:mi>ω</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>μ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mstyle class="text">
<mml:mtext class="textsf" mathvariant="sans-serif">a</mml:mtext>
</mml:mstyle>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mstyle class="text">
<mml:mtext class="textsf" mathvariant="sans-serif">b</mml:mtext>
</mml:mstyle>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-punc">,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi>μ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mstyle class="text">
<mml:mtext class="textsf" mathvariant="sans-serif">a</mml:mtext>
</mml:mstyle>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mstyle class="text">
<mml:mtext class="textsf" mathvariant="sans-serif">b</mml:mtext>
</mml:mstyle>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mi mathvariant="double-struck">I</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext class="textsf" mathvariant="sans-serif">a</mml:mtext>
</mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext class="textsf" mathvariant="sans-serif">b</mml:mtext>
</mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo class="MathClass-close">]</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi mathvariant="double-struck">I</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext class="textsf" mathvariant="sans-serif">a</mml:mtext>
</mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext class="textsf" mathvariant="sans-serif">b</mml:mtext>
</mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo class="MathClass-close">]</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-punc">,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd></mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where
<italic>β
<sub>k,m </sub>
</italic>
are the weights for
<italic>k</italic>
-mers and
<italic>m </italic>
mismatches,
<italic>γ
<sub>i </sub>
</italic>
are the weights for different sequence positions,
<inline-formula>
<mml:math id="M8" name="1752-0509-8-S5-S5-i6" overflow="scroll">
<mml:msub>
<mml:mrow>
<mml:mi>ω</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:math>
</inline-formula>
are the weights assigned to shifts (in either direction) of extent
<italic>s</italic>
, and
<italic>S</italic>
(
<italic>i</italic>
) determines the shift range at position
<italic>i</italic>
.
<inline-formula>
<mml:math id="M9" name="1752-0509-8-S5-S5-i13" overflow="scroll">
<mml:mrow>
<mml:mi mathvariant="double-struck">I</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
[
<bold>a</bold>
<italic>
<sub>k </sub>
</italic>
(
<italic>i </italic>
+
<italic>s</italic>
) =
<italic>
<sub>m </sub>
</italic>
<bold>b</bold>
<italic>
<sub>k </sub>
</italic>
(
<italic>i</italic>
)] equals to 1 if and only if
<bold>a</bold>
<italic>
<sub>k </sub>
</italic>
(
<italic>i </italic>
+
<italic>s</italic>
) and
<bold>b</bold>
<italic>
<sub>k </sub>
</italic>
(
<italic>i</italic>
) differ by exactly
<italic>m </italic>
mismatches, and 0 otherwise. In this case, the computational complexity of calculating WD kernel between two sequences
<bold>a </bold>
and
<bold>b </bold>
is
<italic>O</italic>
(
<italic>dLs</italic>
). Thus, the total runtime to compute the kernel matrix for the entire training dataset is
<italic>O</italic>
(
<italic>n</italic>
<sup>2</sup>
<italic>dLs</italic>
).</p>
</sec>
<sec>
<title>A two-round SVR with WD kernel method</title>
<p>The workflow of the proposed two-round method is shown in Figure
<xref ref-type="fig" rid="F1">1</xref>
. The main idea is to use support vector regression with weighted degree kernels for both feature selection and regression.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption>
<p>
<bold>The workflow of the proposed two-round support vector regression method with weighted degree kernels</bold>
. (a) The input training DNA binding site sequences with their corresponding
<italic>K
<sub>d </sub>
</italic>
values, demonstrating the general form of the inputs. (b) The weighted degree kernel matrix of the first round, calculated from Eq. 2. Each dimension lists the training binding sequences as shown in (a), and the corresponding entry value represents the similarity between the two sequences by the WD kernel. (c) Based on the kernel matrix in (b), we did the first round of support vector regression to select the top ten
<italic>k</italic>
-mers that contribute most to the high binding affinity (in blue) and the ten
<italic>k</italic>
-mers that contribute the most to the low binding affinity (in red). The local optimistic parameters were also selected from this step. (d) The regression of Round 2 to predict binding affinities by using the selected
<italic>k</italic>
-mers in a new WD kernel.</p>
</caption>
<graphic xlink:href="1752-0509-8-S5-S5-1"></graphic>
</fig>
<p>In the first round, we map the input training sequences into a kernel space by the WD kernel with shift limit,
<italic>s</italic>
, and mismatch limit,
<italic>m</italic>
, according to Eq. 2. Here, all the
<italic>k</italic>
-mers up to length
<italic>d </italic>
are used. We then apply WD-kernel-based SVR to learn a model that maps DNA sequences and their
<italic>K
<sub>d </sub>
</italic>
values. The setting of
<italic>d, s </italic>
and
<italic>m </italic>
can be determined by the cross validation (CV) on the training set. From the learned SVR model, the top ten
<italic>k</italic>
-mers that contribute most to the high binding affinity (small
<italic>K
<sub>d </sub>
</italic>
values) and another ten that contribute the most to the low binding affinity (large
<italic>K
<sub>d </sub>
</italic>
values) for each
<italic>k </italic>
up to
<italic>d </italic>
are selected according to their expected decrease and increase of
<italic>f </italic>
(
<bold>x</bold>
), respectively, where
<italic>f </italic>
(
<italic>·</italic>
) is the learned regression model and
<bold>x </bold>
is an input sequence of length
<italic>L</italic>
. Here ten is the default parameter of our method and can be customized by users. Following [
<xref ref-type="bibr" rid="B30">30</xref>
], we quantify the importance of
<bold>x</bold>
<italic>
<sub>k </sub>
</italic>
(
<italic>i</italic>
), which represents a
<italic>k</italic>
-mer starting from position
<italic>i </italic>
of
<bold>x</bold>
, as
<italic>Q</italic>
(
<bold>x</bold>
<italic>
<sub>k </sub>
</italic>
(
<italic>i</italic>
)) =
<inline-formula>
<mml:math id="M10" name="1752-0509-8-S5-S5-i14" overflow="scroll">
<mml:mrow>
<mml:mi mathvariant="double-struck">E</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
[
<italic>f </italic>
(
<bold>x</bold>
)|
<bold>x</bold>
<italic>
<sub>k </sub>
</italic>
(
<italic>i</italic>
)]
<italic></italic>
<inline-formula>
<mml:math id="M11" name="1752-0509-8-S5-S5-i14" overflow="scroll">
<mml:mrow>
<mml:mi mathvariant="double-struck">E</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
[
<italic>f </italic>
(
<bold>x</bold>
)]. In the second round, we learn another SVR model by encoding only the selected
<italic>k</italic>
-mers from Round 1 in the WD kernel. That is, when checking
<italic>k</italic>
-mers in the kernel function, we only count those matches that belong to the selected
<italic>k</italic>
-mers from the first round. In this round, since the important
<italic>k</italic>
-mers are supposed to be more conserved than other subsequences, we allow only shifts up to
<italic>s </italic>
but not mismatches, and, thus, follow Eq. 2. Again, the parameters
<italic>d </italic>
and
<italic>s </italic>
are set by cross-validation on the training set.</p>
<p>Once a test DNA binding sequence is given, it is mapped to a kernel space that is composed of the selected
<italic>k</italic>
-mers from Round 1. The learned SVR model in Round 2 is then applied to predict the
<italic>K
<sub>d </sub>
</italic>
value for this test binding sequence.</p>
</sec>
</sec>
<sec sec-type="results">
<title>Results</title>
<sec>
<title>Datasets</title>
<p>We first applied our method to predict the binding affinity landscape of Gcn4p in
<italic>Saccharomyces cerevisiae </italic>
by using datasets from HiTS-FLIP. Gcn4p is a master regulator that transcriptionally controls the expression of many genes including those for the amino acid biosynthesis pathway [
<xref ref-type="bibr" rid="B31">31</xref>
]. Gcn4p is a basic leucine zipper protein which interacts with DNA binding sites as a dimer [
<xref ref-type="bibr" rid="B32">32</xref>
], and is known to preferentially bind to several sequence motifs, including a 7-mer motif, TGACTCA [
<xref ref-type="bibr" rid="B33">33</xref>
,
<xref ref-type="bibr" rid="B34">34</xref>
]. On the basis of its important role in yeast in general control of amino acids [
<xref ref-type="bibr" rid="B35">35</xref>
], a quantitative characterization of the binding of Gcn4p to its promoter sites is crucial not only for elucidation of the regulatory mechanisms involved in such stress-response pathways, but also to allow design of a synthetic GCN4-induced response pathway in yeast.</p>
<p>The HiTS-FLIP datasets contain
<italic>K
<sub>d </sub>
</italic>
values of 83,252 DNA sequences of length 12bp. In TF-DNA interactions,
<italic>K
<sub>d </sub>
</italic>
represents the concentration of the TF at which the DNA region is occupied 50% of the time at equilibrium. The
<italic>K
<sub>d </sub>
</italic>
values in the HiTS-FLIP datasets range from 8 nM to 1000 nM, where a small
<italic>K
<sub>d </sub>
</italic>
represents high binding affinity and a large
<italic>K
<sub>d </sub>
</italic>
represents low binding affinity. Since the adjustment of binding-affinity for optimal TF-recognition sites is often important in the fine-tuning of the behavior of gene circuits, we focused on modeling the DNA-binding affinity landscape of Gcn4p with strong interactions. Here, DNA sequences with
<italic>K
<sub>d </sub>
</italic>
less than 100 nM were defined as optimal Gcn4p recognition sites. This threshold was chosen by looking up in the HiTS-FLIP datasets the
<italic>K
<sub>d </sub>
</italic>
value of a specific 9-mer, to which Gn4p was reported to bind relatively poorly
<italic>in vivo </italic>
and
<italic>in vitro </italic>
[
<xref ref-type="bibr" rid="B34">34</xref>
]. This resulted in 1,393 DNA sequences. This 12-mer dataset was randomly partitioned into 10 subsets to perform 10-fold CV. All the results of our method and other methods in this paper are based on the same 10-fold CV.</p>
<p>To evaluate the generalization power of our method, we further tested it on four other TF data sets of
<italic>S. cerevisiae</italic>
, i.e., Cbf1p, Cin5p, Pho4p and Yap1p. These four TFs are the ones measured by MITOMI2.0 [
<xref ref-type="bibr" rid="B8">8</xref>
]. Each data set contains the relative binding affinities for nucleotide sequences with 52bp in length, in which shorter binding sites are included. After removing the sequences with "nan" (not a number) and taking average relative affinities for the same sequences, each data set contains 1,456 52bp sequences, with their corresponding relative binding affinities.</p>
</sec>
<sec>
<title>Performance measures</title>
<p>To evaluate the performance of regression methods, we measured the root mean square error (RMSE), root mean square relative error (RMSRE), Pearson product- moment correlation coefficient (Pearson Cor) and Spearman's rank correlation coefficient (Spearman Cor). These measures are defined as follows:</p>
<p>
<disp-formula>
<mml:math id="M12" name="1752-0509-8-S5-S5-i7" overflow="scroll">
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mi>M</mml:mi>
<mml:mi>S</mml:mi>
<mml:mi>E</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mstyle displaystyle="true">
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:munderover>
</mml:mstyle>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo class="MathClass-punc">,</mml:mo>
</mml:mrow>
</mml:msqrt>
<mml:mi>R</mml:mi>
<mml:mi>M</mml:mi>
<mml:mi>S</mml:mi>
<mml:mi>R</mml:mi>
<mml:mi>E</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mstyle displaystyle="true">
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:munderover>
</mml:mstyle>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo class="MathClass-punc">,</mml:mo>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>
<disp-formula>
<mml:math id="M13" name="1752-0509-8-S5-S5-i8" overflow="scroll">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:mi>C</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mover accent="false" class="mml-overline">
<mml:mi>y</mml:mi>
<mml:mo accent="true">¯</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="false" class="mml-overline">
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo accent="true">¯</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mover accent="false" class="mml-overline">
<mml:mi>y</mml:mi>
<mml:mo accent="true">¯</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msqrt>
<mml:msqrt>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="false" class="mml-overline">
<mml:mi>y</mml:mi>
<mml:mo accent="true">¯</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>
<disp-formula>
<mml:math id="M14" name="1752-0509-8-S5-S5-i9" overflow="scroll">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>n</mml:mi>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:mi>C</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mover accent="false" class="mml-overline">
<mml:mi>z</mml:mi>
<mml:mo accent="true">¯</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="false" class="mml-overline">
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mo accent="true">¯</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mover accent="false" class="mml-overline">
<mml:mi>z</mml:mi>
<mml:mo accent="true">¯</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msqrt>
<mml:msqrt>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="false" class="mml-overline">
<mml:mi>z</mml:mi>
<mml:mo accent="true">¯</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where
<italic>y
<sub>i </sub>
</italic>
and
<inline-formula>
<mml:math id="M15" name="1752-0509-8-S5-S5-i10" overflow="scroll">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>
are the real and predicted
<italic>K
<sub>d </sub>
</italic>
values,
<italic>z
<sub>i </sub>
</italic>
and
<inline-formula>
<mml:math id="M16" name="1752-0509-8-S5-S5-i11" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
are the real and predicted rank of
<italic>K
<sub>d </sub>
</italic>
values, for the
<italic>i</italic>
-th binding sequence respectively, and
<italic>n </italic>
is the number of binding sequences in the training or test sets.</p>
</sec>
<sec>
<title>Results of 10-fold cross validation</title>
<p>For selection of parameters (i.e., degree '
<italic>d</italic>
', shift '
<italic>s</italic>
' and mismatch '
<italic>m</italic>
'), a grid search for 2
<italic>≤ d ≤ </italic>
9, 0
<italic>≤ s ≤ </italic>
7, and 0
<italic>≤ m ≤ min</italic>
(
<italic>d</italic>
, 3) on the training sets was conducted on both rounds of our method. When
<italic>d </italic>
increases above 7, there is no significant improvement in the performance, but the running time increases due to the much larger number of
<italic>k</italic>
-mers (Figure
<xref ref-type="fig" rid="F2">2</xref>
and Table
<xref ref-type="table" rid="T1">1</xref>
). Therefore, we chose the best parameter setting in terms of Pearson correlation coefficient for Round 1 as
<italic>d </italic>
= 7,
<italic>s </italic>
= 1, and
<italic>m </italic>
= 1 (Figure
<xref ref-type="fig" rid="F2">2(a)</xref>
). Important
<italic>k</italic>
-mers were thus identified and used as the WD kernel coding subsequences for Round 2. The same grid search was conducted on the training sets of the same 10-fold CV. The best shift and mismatch parameters are expected to be small for Round 2 because Round 1 already identified important positions of the selected
<italic>k</italic>
-mers. This is validated by the best parameter setting of
<italic>d </italic>
= 7,
<italic>s </italic>
= 0, and
<italic>m </italic>
= 0 (Figure
<xref ref-type="fig" rid="F2">2(b)</xref>
). By fixing
<italic>s </italic>
and
<italic>m </italic>
of the test sets to be the best ones identified on the training sets and grid searching for
<italic>d</italic>
, the performance on the test sets was consistent with that on the training sets. Both Round 1 and Round 2 had the best performance on the test sets when
<italic>d </italic>
= 7 (Table
<xref ref-type="table" rid="T1">1</xref>
).</p>
<table-wrap id="T1" position="float">
<label>Table 1</label>
<caption>
<p>Average prediction performance of the Rounds 1 and 2 of our method on test sets of the 10-fold CV.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="center" colspan="5">Test Performance of Round 1: WD with s = 1 & m = 1</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">d</td>
<td align="center">
<bold>Runtime</bold>
</td>
<td align="center">
<bold>RMSE</bold>
</td>
<td align="center">
<bold>Pearson Cor</bold>
</td>
<td align="center">
<bold>Spearman Cor</bold>
</td>
</tr>
<tr>
<td colspan="5">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">2</td>
<td align="center">572</td>
<td align="center">20.06</td>
<td align="center">0.74</td>
<td align="center">0.46</td>
</tr>
<tr>
<td align="center">3</td>
<td align="center">1034</td>
<td align="center">19.99</td>
<td align="center">0.74</td>
<td align="center">0.47</td>
</tr>
<tr>
<td align="center">4</td>
<td align="center">1448</td>
<td align="center">19.87</td>
<td align="center">0.75</td>
<td align="center">0.48</td>
</tr>
<tr>
<td align="center">5</td>
<td align="center">1834</td>
<td align="center">19.79</td>
<td align="center">0.75</td>
<td align="center">0.49</td>
</tr>
<tr>
<td align="center">6</td>
<td align="center">2221</td>
<td align="center">19.77</td>
<td align="center">0.75</td>
<td align="center">0.49</td>
</tr>
<tr>
<td align="center">
<bold>7</bold>
</td>
<td align="center">
<bold>2430</bold>
</td>
<td align="center">
<bold>19.76</bold>
</td>
<td align="center">
<bold>0.75</bold>
</td>
<td align="center">
<bold>0.50</bold>
</td>
</tr>
<tr>
<td align="center">8</td>
<td align="center">2908</td>
<td align="center">19.75</td>
<td align="center">0.75</td>
<td align="center">0.50</td>
</tr>
<tr>
<td align="center">9</td>
<td align="center">3193</td>
<td align="center">19.74</td>
<td align="center">0.75</td>
<td align="center">0.50</td>
</tr>
<tr>
<td colspan="5">
<hr></hr>
</td>
</tr>
<tr>
<td align="center" colspan="5">
<bold>Test Performance of Round 2: WD with s = 0 & m = 0</bold>
</td>
</tr>
<tr>
<td colspan="5">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">d</td>
<td align="center">
<bold>Runtime</bold>
</td>
<td align="center">
<bold>RMSE</bold>
</td>
<td align="center">
<bold>Pearson Cor</bold>
</td>
<td align="center">
<bold>Spearman Cor</bold>
</td>
</tr>
<tr>
<td colspan="5">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">2</td>
<td align="center">47</td>
<td align="center">18.82</td>
<td align="center">0.78</td>
<td align="center">0.55</td>
</tr>
<tr>
<td align="center">3</td>
<td align="center">90</td>
<td align="center">18.09</td>
<td align="center">0.80</td>
<td align="center">0.59</td>
</tr>
<tr>
<td align="center">4</td>
<td align="center">128</td>
<td align="center">17.65</td>
<td align="center">0.81</td>
<td align="center">0.62</td>
</tr>
<tr>
<td align="center">5</td>
<td align="center">166</td>
<td align="center">17.34</td>
<td align="center">0.82</td>
<td align="center">0.65</td>
</tr>
<tr>
<td align="center">6</td>
<td align="center">200</td>
<td align="center">17.09</td>
<td align="center">0.83</td>
<td align="center">0.66</td>
</tr>
<tr>
<td align="center">
<bold>7</bold>
</td>
<td align="center">
<bold>235</bold>
</td>
<td align="center">
<bold>16.89</bold>
</td>
<td align="center">
<bold>0.84</bold>
</td>
<td align="center">
<bold>0.68</bold>
</td>
</tr>
<tr>
<td align="center">8</td>
<td align="center">268</td>
<td align="center">16.89</td>
<td align="center">0.84</td>
<td align="center">0.65</td>
</tr>
<tr>
<td align="center">9</td>
<td align="center">302</td>
<td align="center">16.87</td>
<td align="center">0.84</td>
<td align="center">0.65</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Round 1 uses all
<italic>k</italic>
-mers up to length
<italic>d</italic>
, with shift = 1 and mismatch = 1. Round 2 uses only selected
<italic>k</italic>
-mers from Round 1, with shift = 0 and mismatch = 0. 'Runtime' includes both training and testing, in seconds. The values for the parameters selected on training data are in bold.</p>
</table-wrap-foot>
</table-wrap>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption>
<p>
<bold>Grid search of parameters on the training data of Rounds 1 and 2 of our method</bold>
. (a) Grid search of degree and shift for Round 1 in terms of the average
<italic>Pearson Cor </italic>
with mismatch 1. The parameter of mismatch can be searched in a similar manner, which is not shown here. (b) Grid search of degree and shift of for Round 2 in terms of
<italic>Pearson Cor</italic>
, with mismatch 0.</p>
</caption>
<graphic xlink:href="1752-0509-8-S5-S5-2"></graphic>
</fig>
<p>Comparison of the relative performance of Rounds 1 and 2 reveals that Round 2 has significant improvements with respect to all the measures (Table
<xref ref-type="table" rid="T1">1</xref>
). In particular, the RMSE, Pearson correlation and Spearman correlation of Round 2 are better than those of Round 1 by 15%, 12% and 36%, respectively. This is due to the fact that Round 1 encodes many irrelevant
<italic>k</italic>
-mers, whereas Round 2 encodes only the important
<italic>k</italic>
-mers identified in Round 1. The kernel mapping of Round 2 is thus far more accurate than that of Round 1. With
<italic>d </italic>
= 7, Round 1 encodes 21,844
<italic>k</italic>
-mers to calculate the kernel function, whereas Round 2 encodes only 140
<italic>k</italic>
-mers. This explains the significant improvement on the runtime of Round 2 over Round 1, although Round 2 requires inputs from Round 1.</p>
<p>The results from the 10-fold CV have demonstrated the effectiveness of the
<italic>k</italic>
-mer selection of Round 1. Figure
<xref ref-type="fig" rid="F3">3</xref>
shows two illustrative examples of importance matrices for
<italic>k </italic>
= 2 and
<italic>k </italic>
= 3. The baseline color is yellow. The red color indicates that the
<italic>k</italic>
-mer at the corresponding starting position in the 12-mer binding sequence contributes to low binding affinity (a large
<italic>K
<sub>d </sub>
</italic>
value), whereas the blue color indicates a contribution to high binding affinity. For instance, TT at positions 4-6 tends to lead to a large
<italic>K
<sub>d </sub>
</italic>
value (Figure
<xref ref-type="fig" rid="F3">3(a)</xref>
). AA and AC, on the other hand, are preferred 2-mers at position 5. The effect of TT can be further decomposed into seven 3-mers as shown in Figure
<xref ref-type="fig" rid="F3">3(b)</xref>
, that is, ATT, CTT, GTT, TTT, TTA, TTC, and TTG. Among them, CTT and TTT are those that contribute most to a large
<italic>K
<sub>d </sub>
</italic>
value if one of them appear at position 4 of the 12-mer Gcn4p-DNA binding sequence. It should be noticed that such importance matrices contain both uncertainty and diversity: uncertainty means that due to the effects of other contributing
<italic>k</italic>
-mers, a
<italic>k</italic>
-mer with a red color does not necessarily lead to a high
<italic>K
<sub>d </sub>
</italic>
value, and diversity means that multiple
<italic>k</italic>
-mers can contribute to
<italic>K
<sub>d </sub>
</italic>
values at the same position. Nevertheless, these importance matrices still provide an intuitive means for researchers to visualize and interpret results, and thus gain insights into the design of a binding sequence with a desired binding affinity.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption>
<p>
<bold>The importance of all the (a) 2-mers and (b) 3-mers at different positions from Round 1. </bold>
The x-axis lays out all the 2-mers and 3-mers, respectively. The y-axis shows the positions within the 12-mer DNA binding sequence. The baseline color is yellow. Red color denotes the effect of leading to large
<italic>K
<sub>d </sub>
</italic>
values, whereas blue color denotes the effect of leading to small
<italic>K
<sub>d </sub>
</italic>
values.</p>
</caption>
<graphic xlink:href="1752-0509-8-S5-S5-3"></graphic>
</fig>
</sec>
<sec>
<title>Comparison with state-of-the-art methods</title>
<p>Our method was further compared with state-of-the-art methods on the same datasets. HK
<italic></italic>
ME [
<xref ref-type="bibr" rid="B21">21</xref>
] was the best performer at the DREAM5 transcription factor/DNA motif recognition competition. According to Annala's method, HK
<italic></italic>
ME was set to use all the 4-6-mers as well as 2000 7-mers and 1000 8-mers with the lowest median
<italic>K
<sub>d </sub>
</italic>
values. We also compared with PWM which assumed the mononucleotide contributed independently to the binding affinity. The model
<italic>x </italic>
of PWM is solved from the function
<italic>A · × </italic>
=
<italic>K </italic>
where
<italic>A </italic>
is an
<italic>n × u </italic>
matrix, where
<italic>n </italic>
is the number of training sequences, and
<italic>u </italic>
=
<italic>L × z </italic>
where
<italic>L </italic>
is the length of each training sequence and
<italic>z </italic>
is the size of the dictionary. For the Gcn4p dataset,
<italic>L </italic>
is 12 and
<italic>z </italic>
is 4. Here
<italic>A</italic>
[
<italic>i, j</italic>
] is set to 1 if the
<italic>i</italic>
-th sequence contains the specific nucleotide at the specific position indicated by the index
<italic>j</italic>
, otherwise 0. The
<italic>x </italic>
is a
<italic>u</italic>
-dimensional column vector to be trained, each entry of which represents the weight for the corresponding nucleotide at the corresponding position. For the Gcn4p dataset,
<italic>K </italic>
is set to be
<italic>ln</italic>
(
<italic>K
<sub>d</sub>
</italic>
) because
<italic>ln</italic>
(
<italic>K
<sub>d</sub>
</italic>
) is proportional to the binding free energy, which is assumed to be additive.</p>
<p>Table
<xref ref-type="table" rid="T2">2</xref>
shows the comparison between our method and state-of-the-art methods on the same 10-fold CV. Our method significantly outperforms the other three methods. In particular, compared with HK
<italic></italic>
ME, our method scored 71% higher in the Pearson correlation and 51% higher in the Spearman correlation; compared with PWM, we have 50% higher in the Pearson correlation and 36% higher in the Spearman correlation. There are at least three reasons for these significant improvements. First, our method does not depend on prior knowledge of which
<italic>k</italic>
-mers are important, rather it systematically explores all
<italic>k</italic>
-mers up to length
<italic>d</italic>
. Secondly, our method selects the most important
<italic>k</italic>
-mers with different
<italic>k </italic>
values based on the expected importance of such subsequences at different positions, whereas PWM assumes 1-mers are important and HK
<italic></italic>
ME assumes certain
<italic>k</italic>
-mers are important without considering their positions and number of occurrence. Thirdly, the discriminative power of SVR ensures an accurate regression in the kernel space. We also implemented the SVR model with the WD kernel without shift or mismatch by
<italic>d </italic>
= 7 (as shown in Table
<xref ref-type="table" rid="T2">2</xref>
) and found that it also significantly outperforms HK
<italic></italic>
ME and compares favorably to PWM. Our method, however, is clearly better than the SVR model on the basis that it allows shifts and mismatches, and conducts an additional round of
<italic>k</italic>
-mer selection.</p>
<table-wrap id="T2" position="float">
<label>Table 2</label>
<caption>
<p>Comparison with state-of-the-art methods.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th></th>
<th align="center">PWM</th>
<th align="center">HK
<italic></italic>
ME</th>
<th align="center">SVR w. WD</th>
<th align="center">Our Method</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">RMSE</td>
<td align="center">20.2</td>
<td align="center">25.4</td>
<td align="center">22.5</td>
<td align="center">
<bold>16.9</bold>
</td>
</tr>
<tr>
<td align="center">RMSRE</td>
<td align="center">46%</td>
<td align="center">51%</td>
<td align="center">58%</td>
<td align="center">
<bold>44%</bold>
</td>
</tr>
<tr>
<td align="center">Pearson Cor</td>
<td align="center">0.56</td>
<td align="center">0.49</td>
<td align="center">0.70</td>
<td align="center">
<bold>0.84</bold>
</td>
</tr>
<tr>
<td align="center">Spearman Cor</td>
<td align="center">0.50</td>
<td align="center">0.45</td>
<td align="center">0.50</td>
<td align="center">
<bold>0.68</bold>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>"PWM" represents the position weight matrix model. "HK
<italic></italic>
ME" represents the linear model in [
<xref ref-type="bibr" rid="B21">21</xref>
]. "SVR w. WD" represents SVR with WD kernel without mismatch or shift. All values are the averages over the same 10-fold CV. The best values in each row are in bold.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>Discovery of a stable high-affinity 10-mer motif</title>
<p>Since the two-round SVR model significantly increased the accuracy and efficiency of prediction to map DNA sequences to their
<italic>K
<sub>d </sub>
</italic>
values for the Gcn4p binding, we set out to characterize
<italic>k</italic>
-mers identified as being important through Round 1. In particular, we focused on the ten 7-mers that were selected to be important for high-affinity 12-mers. These ten 7-mers are listed in Table
<xref ref-type="table" rid="T3">3</xref>
and Figure
<xref ref-type="fig" rid="F4">4</xref>
. We noticed that these 7-mers appear relatively frequently throughout the 12-mer dataset, which contains DNA sequences with
<italic>K
<sub>d </sub>
</italic>
up to 1,000 nM. We measured statistics of the
<italic>K
<sub>d </sub>
</italic>
values of 12-mers composed of these important 7-mers, and found that six 7-mers, ATGACTC, TGACTCA, GTGACTC, TGAGTCA, TATGACT, and GACTCAT lead to a much lower dispersion of
<italic>K
<sub>d </sub>
</italic>
values than the other four important 7-mers (Figure
<xref ref-type="fig" rid="F4">4</xref>
). In other words, these six 7-mers are dominant factors that can, in most situations, stabilize the value of
<italic>K
<sub>d </sub>
</italic>
regardless of the context in which these 7-mers appear in nucleotide sequences.</p>
<table-wrap id="T3" position="float">
<label>Table 3</label>
<caption>
<p>Statistics of the ten 7-mers that were identified to be important for high-affinity 12-mers through Round 1.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="center">Rank</th>
<th align="center">7-mer</th>
<th align="center">Freq.</th>
<th align="center">MIN</th>
<th align="center">MAX</th>
<th align="center">Average</th>
<th align="center">Standard
<break></break>
Deviation</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">1</td>
<td align="center">
<bold>ATGACTC</bold>
</td>
<td align="center">419</td>
<td align="center">8.49</td>
<td align="center">409.08</td>
<td align="center">39.31</td>
<td align="center">43.04</td>
</tr>
<tr>
<td align="center">2</td>
<td align="center">
<bold>TGACTCA</bold>
</td>
<td align="center">990</td>
<td align="center">8.49</td>
<td align="center">567.81</td>
<td align="center">56.66</td>
<td align="center">54.61</td>
</tr>
<tr>
<td align="center">3</td>
<td align="center">
<bold>GTGACTC</bold>
</td>
<td align="center">446</td>
<td align="center">9.83</td>
<td align="center">648.79</td>
<td align="center">74.46</td>
<td align="center">96.84</td>
</tr>
<tr>
<td align="center">4</td>
<td align="center">
<bold>TGAGTCA</bold>
</td>
<td align="center">453</td>
<td align="center">14.52</td>
<td align="center">303.87</td>
<td align="center">63.66</td>
<td align="center">54.64</td>
</tr>
<tr>
<td align="center">5</td>
<td align="center">
<bold>TATGACT</bold>
</td>
<td align="center">224</td>
<td align="center">8.74</td>
<td align="center">896.78</td>
<td align="center">112.54</td>
<td align="center">190.25</td>
</tr>
<tr>
<td align="center">6</td>
<td align="center">
<bold>GACTCAT</bold>
</td>
<td align="center">392</td>
<td align="center">8.49</td>
<td align="center">963.28</td>
<td align="center">167.26</td>
<td align="center">254.46</td>
</tr>
<tr>
<td align="center">7</td>
<td align="center">ATGAGTC</td>
<td align="center">504</td>
<td align="center">15.60</td>
<td align="center">975.18</td>
<td align="center">276.01</td>
<td align="center">292.93</td>
</tr>
<tr>
<td align="center">8</td>
<td align="center">TGACTAA</td>
<td align="center">327</td>
<td align="center">14.67</td>
<td align="center">821.67</td>
<td align="center">192.02</td>
<td align="center">199.69</td>
</tr>
<tr>
<td align="center">9</td>
<td align="center">TACTCAC</td>
<td align="center">847</td>
<td align="center">9.65</td>
<td align="center">975.05</td>
<td align="center">437.92</td>
<td align="center">336.43</td>
</tr>
<tr>
<td align="center">10</td>
<td align="center">GACTAAT</td>
<td align="center">808</td>
<td align="center">14.67</td>
<td align="center">984.67</td>
<td align="center">528.74</td>
<td align="center">300.75</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The seven columns list the rank of importance, nucleotide sequence, number of 12-mer sequences that contain this 7-mer, the minimum
<italic>K
<sub>d </sub>
</italic>
for all such 12-mers, the maximum
<italic>K
<sub>d </sub>
</italic>
for all such 12-mers, the mean
<italic>K
<sub>d </sub>
</italic>
value for these 12-mers, and the standard deviation of these 12-mers, respectively. The six 7-mers in bold are the ones with lower dispersions of
<italic>K
<sub>d </sub>
</italic>
values than the remainders.</p>
</table-wrap-foot>
</table-wrap>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption>
<p>
<bold>Box plots of the 7-mers identified to be important for high-affinity 12-mers</bold>
. (a) The distribution of
<italic>K
<sub>d </sub>
</italic>
of the important 7-mers in the same order as in Table
<xref ref-type="table" rid="T3">3</xref>
. (b) The distribution of
<italic>K
<sub>d </sub>
</italic>
of the three predicted 10-mers, including the stable 10-mers TATGACTCAT and TGTGACTCAT (the left two), and the sensitive 10-mer CATGACTAAT (the right one).</p>
</caption>
<graphic xlink:href="1752-0509-8-S5-S5-4"></graphic>
</fig>
<p>Multiple sequence alignment of the six robust 7-mers reveals a high-affinity 10- mer motif, TRTGACTCAT. Interestingly, we found that, in the 12-mer dataset, the sequences composed of this 10-mer motif were guaranteed to have high binding affinity with the
<italic>K
<sub>d </sub>
</italic>
value being less than 100 nM. In addition, we found that this 10-mer motif leads to even lower variability in the
<italic>K
<sub>d </sub>
</italic>
values than the six 7-mers (Figure
<xref ref-type="fig" rid="F4">4</xref>
). Indeed, the mean and the standard deviation of all 12-mers composed of the two concrete sequences from this motif show high binding affinity with a low dispersion rate; TATGACTCAT has mean 23.89 nM and standard deviation nM, while TGTGACTCAT has mean 26.88 nM and standard deviation 18.25 nM. These results indicate that the high-affinity 10-mer motif we found is even more dominant factor that alone can stabilize
<italic>K
<sub>d </sub>
</italic>
at a low level. [
<xref ref-type="bibr" rid="B9">9</xref>
] confirmed that a palindromic 9-mer motif, ATGACTCAT, has a higher binding affinity than the consensus 7-mer motif, TGACTCA, as observed in previous studies [
<xref ref-type="bibr" rid="B33">33</xref>
,
<xref ref-type="bibr" rid="B36">36</xref>
]. While this 9-mer motif can indeed result in high binding affinity, it can also lead to much lower binding affinity than our 10-mer motif. For example, a 12-mer sequence, CATGACTCATAG, is observed to have
<italic>K
<sub>d </sub>
</italic>
value of 265.8 nM in the HiTS-FLIP dataset. Since the last 8 bases of the two motifs are the same (i.e., 5'-TGACTCAT- 3'), these observations indicate that our motif can further stabilize the
<italic>K
<sub>d </sub>
</italic>
values at a high affinity level by including an additional nucleotide in the left half-site. This is consistent with a previous experimental study in that, while Gcn4p binds to DNA sites as a dimer, the left half-site plays more important role than the right half-site in strong Gcn4p-DNA interactions [
<xref ref-type="bibr" rid="B34">34</xref>
].</p>
<p>In addition to the high-affinity 10-mer motif, we found a 10-mer sequence, CAT-GACTAAT, by performing multiple sequence alignment of 4 other 7-mers identified through Round 1 (i.e., ATGAGTC, TGACTAA, GACTCAC, and GACTAAT). Unlike the 10-mer motif that we found, however, this 10-mer is more context- dependent. While 12-mers composed of this 10-mer also gave relatively high mean binding affinity (
<italic>K
<sub>d </sub>
</italic>
= 97.88 nM), the range of the binding affinities increased significantly to include the
<italic>K
<sub>d </sub>
</italic>
values between 20.61 nM and 432.00 nM and to have the standard deviation of 90.65 nM (Figure
<xref ref-type="fig" rid="F4">4</xref>
). This shows that two nucleotide substitutions from the low-variance, high-affinity 10-mer motif can substantially alter the characteristics of the 10-mer. These observations indicate that the DNA binding affinity landscape of Gcn4p is very complex and that a strong interdependency is prevalent. This suggests that models based on additive, independent characteristics of binding free energy may not be able to quantitatively capture interactions of DNA and dimeric--and more generally oligomeric--TFs and that efficient models that consider interdependency of subsequences are key to understanding the DNA binding affinity landscape of such proteins and to fine-tuning of gene expression processes.</p>
</sec>
<sec>
<title>Results on four other TFs in
<italic>S. cerevisiae</italic>
</title>
<p>To evaluate the generality of our method, we tested it on MITOMI2.0 datasets including four other TFs in
<italic>S. cerevisiae</italic>
, namely Cbf1p, Cin5p, Pho4p and Yap1p [
<xref ref-type="bibr" rid="B8">8</xref>
]. A same 5-fold CV was applied to evaluate each method. As shown in Tables 2 and 4, the Pearson correlation coefficient of different methods on these four TFs decrease significantly from that on Gcn4p. This makes sense because the input sequences in these four TFs are much longer than that in Gcn4p (52bp v.s. 12bp), which significantly increases the difficulty level of regression. Nevertheless, the outperformance of our method over the other methods is consistent with that on the Gcn4p dataset. This demonstrates the generality of our method and also suggests that our method can be applied to longer DNA sequences with high accuracy. The latter is essential to prediction of the binding affinity landscape of oligomeric TFs.</p>
<table-wrap id="T4" position="float">
<label>Table 4</label>
<caption>
<p>Comparison with state-of-the-art methods on four other TFs in S. cerevisiae</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th></th>
<th align="center">PWM</th>
<th align="center">HK
<italic></italic>
ME</th>
<th align="center">SVR
<break></break>
w. WD</th>
<th align="center">Our
<break></break>
Method</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">Cbf1p</td>
<td align="center">0.25</td>
<td align="center">0.30</td>
<td align="center">0.36</td>
<td align="center">
<bold>0.63</bold>
</td>
</tr>
<tr>
<td align="center">Cin5p</td>
<td align="center">0.21</td>
<td align="center">0.26</td>
<td align="center">0.47</td>
<td align="center">
<bold>0.62</bold>
</td>
</tr>
<tr>
<td align="center">Pho4p</td>
<td align="center">0.19</td>
<td align="center">0.24</td>
<td align="center">0.41</td>
<td align="center">
<bold>0.61</bold>
</td>
</tr>
<tr>
<td align="center">Yap1p</td>
<td align="center">0.22</td>
<td align="center">0.24</td>
<td align="center">0.40</td>
<td align="center">
<bold>0.58</bold>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>All values are
<italic>Pearson Cor </italic>
and averaged over the same 5-fold CV. The best values in each row are in bold.</p>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
<sec sec-type="conclusions">
<title>Conclusion</title>
<p>In this paper, we have proposed a novel two-round support vector regression method that is based on weighted degree kernels with shifts and mismatches, with the first round focusing on feature selection and the second round focusing on regression. The WD kernels have been used with support vector classification method and successfully applied to a number of biological sequence classification problems, including transcription start site prediction [
<xref ref-type="bibr" rid="B37">37</xref>
], splice site prediction [
<xref ref-type="bibr" rid="B38">38</xref>
], alternative splicing site prediction [
<xref ref-type="bibr" rid="B28">28</xref>
], trans-splicing site prediction [
<xref ref-type="bibr" rid="B39">39</xref>
], and translation initiation site prediction [
<xref ref-type="bibr" rid="B40">40</xref>
]. However, the power of combining the WD kernels with the support vector regression has not been well studied in bioinformatics. Further, to the best of our knowledge, two rounds of string kernels have not been applied to identify crucial k-mers and to avoid projecting the input sequences to overly high-dimensional kernel space.</p>
<p>We applied the proposed two-round method to model the mapping of DNA sequences to their binding affinity for the Gcn4p binding in yeast using high-resolution datasets measured by HiTS-FLIP. We showed that the quantitative prediction from our new method is significantly improved over existing methods. We further demonstrated that the identification of important subsequences would allow extraction of human-interpretable rules for the purpose of quantitative control of binding affinity. Two 10-mers were predicted by our method that were surprisingly stable but were not previously reported. Another 10-mer that just has two nucleotide changes from one of the stable ones was predicted that was comparatively sensitive. Additional tests on four other TFs validate the generalization power of the proposed method. Our program and sample data are freely available at
<ext-link ext-link-type="uri" xlink:href="http://sfb.kaust.edu.sa/Pages/Software.aspx">http://sfb.kaust.edu.sa/Pages/Software.aspx</ext-link>
.</p>
</sec>
<sec>
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
</body>
<back>
<sec>
<title>Acknowledgements</title>
<p>We thank Polly Fordyce for valuable discussions about the MITOMI2.0 datasets. This work and publication costs were supported by the grant number, FCC/1/1976-04-01, made by King Abdullah University of Science and Technology (KAUST).</p>
<p>This article has been published as part of
<italic>BMC systems Biology </italic>
Volume 8 Supplement 5, 2014: Proceedings of the 25th International Conference on Genome Informatics (GIW/ISCB-Asia): Systems Biology. The full contents of the supplement are available online at
<ext-link ext-link-type="uri" xlink:href="http://www.biomedcentral.com/bmcsystbiol/supplements/8/S5">http://www.biomedcentral.com/bmcsystbiol/supplements/8/S5</ext-link>
.</p>
</sec>
<ref-list>
<ref id="B1">
<mixed-citation publication-type="journal">
<name>
<surname>Endy</surname>
<given-names>D</given-names>
</name>
<article-title>Foundations for engineering biology</article-title>
<source>Nature</source>
<year>2005</year>
<volume>438</volume>
<issue>7067</issue>
<fpage>449</fpage>
<lpage>453</lpage>
<pub-id pub-id-type="doi">10.1038/nature04342</pub-id>
<pub-id pub-id-type="pmid">16306983</pub-id>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="journal">
<name>
<surname>Purnick</surname>
<given-names>PEM</given-names>
</name>
<name>
<surname>Weiss</surname>
<given-names>R</given-names>
</name>
<article-title>The second wave of synthetic biology: from modules to systems</article-title>
<source>Nat Rev Mol Cell Biol</source>
<year>2009</year>
<volume>10</volume>
<issue>6</issue>
<fpage>410</fpage>
<lpage>422</lpage>
<pub-id pub-id-type="doi">10.1038/nrm2698</pub-id>
<pub-id pub-id-type="pmid">19461664</pub-id>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal">
<name>
<surname>Kuwahara</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Fan</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>X</given-names>
</name>
<article-title>A framework for scalable parameter estimation of gene circuit models using structural information</article-title>
<source>Bioinformatics</source>
<year>2013</year>
<volume>29</volume>
<issue>13</issue>
<fpage>98</fpage>
<lpage>107</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt232</pub-id>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="book">
<name>
<surname>Alberts</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Johnson</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Lewis</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Raff</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Roberts</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Walter</surname>
<given-names>P</given-names>
</name>
<article-title>Molecular Biology of the Cell</article-title>
<source>Garland Science</source>
<year>2002</year>
<edition>4th</edition>
<comment>New York</comment>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<name>
<surname>Berger</surname>
<given-names>MF</given-names>
</name>
<name>
<surname>Bulyk</surname>
<given-names>ML</given-names>
</name>
<article-title>Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors</article-title>
<source>Nat Protoc</source>
<year>2009</year>
<volume>4</volume>
<issue>3</issue>
<fpage>393</fpage>
<lpage>411</lpage>
<pub-id pub-id-type="doi">10.1038/nprot.2008.195</pub-id>
<pub-id pub-id-type="pmid">19265799</pub-id>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="journal">
<name>
<surname>Gottardo</surname>
<given-names>R</given-names>
</name>
<article-title>Modeling and analysis of ChIP-chip experiments</article-title>
<source>Methods Mol Biol</source>
<year>2009</year>
<volume>567</volume>
<fpage>133</fpage>
<lpage>143</lpage>
<pub-id pub-id-type="doi">10.1007/978-1-60327-414-2_9</pub-id>
<pub-id pub-id-type="pmid">19588090</pub-id>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="journal">
<name>
<surname>Maerkl</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Quake</surname>
<given-names>SR</given-names>
</name>
<article-title>A systems approach to measuring the binding energy landscapes of transcription factors</article-title>
<source>Science</source>
<year>2007</year>
<volume>315</volume>
<issue>5809</issue>
<fpage>233</fpage>
<lpage>237</lpage>
<pub-id pub-id-type="doi">10.1126/science.1131007</pub-id>
<pub-id pub-id-type="pmid">17218526</pub-id>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="journal">
<name>
<surname>Fordyce</surname>
<given-names>PM</given-names>
</name>
<name>
<surname>Gerber</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Tran</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Zheng</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
<name>
<surname>DeRisi</surname>
<given-names>JL</given-names>
</name>
<name>
<surname>Quake</surname>
<given-names>SR</given-names>
</name>
<article-title>De novo identification and biophysical characterization of transcription-factor binding sites with microfluidic affinity analysis</article-title>
<source>Nat Biotechnol</source>
<year>2010</year>
<volume>28</volume>
<issue>9</issue>
<fpage>970</fpage>
<lpage>975</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.1675</pub-id>
<pub-id pub-id-type="pmid">20802496</pub-id>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="journal">
<name>
<surname>Nutiu</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Friedman</surname>
<given-names>RC</given-names>
</name>
<name>
<surname>Luo</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Khrebtukova</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Silva</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Schroth</surname>
<given-names>GP</given-names>
</name>
<name>
<surname>Burge</surname>
<given-names>CB</given-names>
</name>
<article-title>Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument</article-title>
<source>Nat Biotechnol</source>
<year>2011</year>
<volume>29</volume>
<issue>7</issue>
<fpage>659</fpage>
<lpage>664</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.1882</pub-id>
<pub-id pub-id-type="pmid">21706015</pub-id>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal">
<name>
<surname>Alleyne</surname>
<given-names>TM</given-names>
</name>
<name>
<surname>Peña-Castillo</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Badis</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Talukder</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Berger</surname>
<given-names>MF</given-names>
</name>
<name>
<surname>Gehrke</surname>
<given-names>AR</given-names>
</name>
<name>
<surname>Philippakis</surname>
<given-names>AA</given-names>
</name>
<name>
<surname>Bulyk</surname>
<given-names>ML</given-names>
</name>
<name>
<surname>Morris</surname>
<given-names>QD</given-names>
</name>
<name>
<surname>Hughes</surname>
<given-names>TR</given-names>
</name>
<article-title>Predicting the binding preference of transcription factors to individual DNA
<italic>k</italic>
-mers</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<issue>8</issue>
<fpage>1012</fpage>
<lpage>1018</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btn645</pub-id>
<pub-id pub-id-type="pmid">19088121</pub-id>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="journal">
<name>
<surname>Weirauch</surname>
<given-names>MT</given-names>
</name>
<name>
<surname>Cote</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Norel</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Annala</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Riley</surname>
<given-names>TR</given-names>
</name>
<name>
<surname>Saez-Rodriguez</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Cokelaer</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Vedenko</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Talukder</surname>
<given-names>S</given-names>
</name>
<name>
<surname>DREAMC</surname>
</name>
<name>
<surname>Bussemaker</surname>
<given-names>HJ</given-names>
</name>
<name>
<surname>Morris</surname>
<given-names>QD</given-names>
</name>
<name>
<surname>Bulyk</surname>
<given-names>ML</given-names>
</name>
<name>
<surname>Stolovitzky</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Hughes</surname>
<given-names>TR</given-names>
</name>
<article-title>Evaluation of methods for modeling transcription factor sequence specificity</article-title>
<source>Nat Biotechnol</source>
<year>2013</year>
<volume>31</volume>
<issue>2</issue>
<fpage>126</fpage>
<lpage>134</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.2486</pub-id>
<pub-id pub-id-type="pmid">23354101</pub-id>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="journal">
<name>
<surname>Berg</surname>
<given-names>OG</given-names>
</name>
<name>
<surname>von Hippel</surname>
<given-names>PH</given-names>
</name>
<article-title>Selection of DNA binding sites by regulatory proteins. statistical-mechanical theory and application to operators and promoters</article-title>
<source>J Mol Biol</source>
<year>1987</year>
<volume>193</volume>
<issue>4</issue>
<fpage>723</fpage>
<lpage>750</lpage>
<pub-id pub-id-type="doi">10.1016/0022-2836(87)90354-8</pub-id>
<pub-id pub-id-type="pmid">3612791</pub-id>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal">
<name>
<surname>Stormo</surname>
<given-names>GD</given-names>
</name>
<article-title>DNA binding sites: representation and discovery</article-title>
<source>Bioinformatics</source>
<year>2000</year>
<volume>16</volume>
<issue>1</issue>
<fpage>16</fpage>
<lpage>23</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/16.1.16</pub-id>
<pub-id pub-id-type="pmid">10812473</pub-id>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="journal">
<name>
<surname>Benos</surname>
<given-names>PV</given-names>
</name>
<name>
<surname>Bulyk</surname>
<given-names>ML</given-names>
</name>
<name>
<surname>Stormo</surname>
<given-names>GD</given-names>
</name>
<article-title>Additivity in protein-DNA interactions: how good an approximation is it?</article-title>
<source>Nucleic Acids Res</source>
<year>2002</year>
<volume>30</volume>
<issue>20</issue>
<fpage>4442</fpage>
<lpage>4451</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkf578</pub-id>
<pub-id pub-id-type="pmid">12384591</pub-id>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="journal">
<name>
<surname>Bulyk</surname>
<given-names>ML</given-names>
</name>
<name>
<surname>Johnson</surname>
<given-names>PLF</given-names>
</name>
<name>
<surname>Church</surname>
<given-names>GM</given-names>
</name>
<article-title>Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors</article-title>
<source>Nucleic Acids Res</source>
<year>2002</year>
<volume>30</volume>
<issue>5</issue>
<fpage>1255</fpage>
<lpage>1261</lpage>
<pub-id pub-id-type="doi">10.1093/nar/30.5.1255</pub-id>
<pub-id pub-id-type="pmid">11861919</pub-id>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal">
<name>
<surname>Liu</surname>
<given-names>XS</given-names>
</name>
<name>
<surname>Brutlag</surname>
<given-names>DL</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>JS</given-names>
</name>
<article-title>An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments</article-title>
<source>Nat Biotechnol</source>
<year>2002</year>
<volume>20</volume>
<issue>8</issue>
<fpage>835</fpage>
<lpage>839</lpage>
<pub-id pub-id-type="doi">10.1038/nbt717</pub-id>
<pub-id pub-id-type="pmid">12101404</pub-id>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="journal">
<name>
<surname>Foat</surname>
<given-names>BC</given-names>
</name>
<name>
<surname>Morozov</surname>
<given-names>AV</given-names>
</name>
<name>
<surname>Bussemaker</surname>
<given-names>HJ</given-names>
</name>
<article-title>Statistical mechanical modeling of genome-wide transcription factor occupancy data by matrixreduce</article-title>
<source>Bioinformatics</source>
<year>2006</year>
<volume>22</volume>
<issue>14</issue>
<fpage>141</fpage>
<lpage>149</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btl223</pub-id>
</mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="journal">
<name>
<surname>Chen</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Hughes</surname>
<given-names>TR</given-names>
</name>
<name>
<surname>Morris</surname>
<given-names>Q</given-names>
</name>
<article-title>RankMotif++: a motif-search algorithm that accounts for relative ranks of
<italic>K </italic>
-mers in binding transcription factors</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<issue>13</issue>
<fpage>72</fpage>
<lpage>79</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btm224</pub-id>
</mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="journal">
<name>
<surname>Agius</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Arvey</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Chang</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Noble</surname>
<given-names>WS</given-names>
</name>
<name>
<surname>Leslie</surname>
<given-names>C</given-names>
</name>
<article-title>High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions</article-title>
<source>PLoS Comput Biol</source>
<year>2010</year>
<volume>6</volume>
<issue>9</issue>
</mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="journal">
<name>
<surname>Lee</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Karchin</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Beer</surname>
<given-names>MA</given-names>
</name>
<article-title>Discriminative prediction of mammalian enhancers from DNA sequence</article-title>
<source>Genome Res</source>
<year>2011</year>
<volume>21</volume>
<issue>12</issue>
<fpage>2167</fpage>
<lpage>2180</lpage>
<pub-id pub-id-type="doi">10.1101/gr.121905.111</pub-id>
<pub-id pub-id-type="pmid">21875935</pub-id>
</mixed-citation>
</ref>
<ref id="B21">
<mixed-citation publication-type="journal">
<name>
<surname>Annala</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Laurila</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Lähdesmäki</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Nykter</surname>
<given-names>M</given-names>
</name>
<article-title>A linear model for transcription factor binding affinity prediction in protein binding microarrays</article-title>
<source>PLoS One</source>
<year>2011</year>
<volume>6</volume>
<issue>5</issue>
<fpage>20059</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0020059</pub-id>
</mixed-citation>
</ref>
<ref id="B22">
<mixed-citation publication-type="other">
<name>
<surname>Vapnik</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Chervonenkis</surname>
<given-names>A</given-names>
</name>
<article-title>Theory of Pattern Recognition</article-title>
<source>Nauka</source>
<year>1974</year>
<comment>Moscow</comment>
</mixed-citation>
</ref>
<ref id="B23">
<mixed-citation publication-type="journal">
<name>
<surname>Jebara</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Kondor</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Howard</surname>
<given-names>A</given-names>
</name>
<article-title>Probability product kernels</article-title>
<source>The Journal of Machine Learning Research</source>
<year>2004</year>
<volume>5</volume>
<fpage>819</fpage>
<lpage>844</lpage>
</mixed-citation>
</ref>
<ref id="B24">
<mixed-citation publication-type="journal">
<name>
<surname>Xie</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Jankovic</surname>
<given-names>BR</given-names>
</name>
<name>
<surname>Bajic</surname>
<given-names>VB</given-names>
</name>
<name>
<surname>Song</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>X</given-names>
</name>
<article-title>Poly(A) motif prediction using spectral latent features from human DNA sequences</article-title>
<source>Bioinformatics</source>
<year>2013</year>
<volume>29</volume>
<issue>13</issue>
<fpage>316</fpage>
<lpage>325</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt218</pub-id>
<pub-id pub-id-type="pmid">23267171</pub-id>
</mixed-citation>
</ref>
<ref id="B25">
<mixed-citation publication-type="other">
<name>
<surname>Leslie</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Eskin</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Noble</surname>
<given-names>WS</given-names>
</name>
<article-title>The spectrum kernel: a string kernel for SVM protein classification</article-title>
<source>Proceedings of Pacific Symposium on Biocomputing (PSB2002)</source>
<year>2002</year>
<fpage>546</fpage>
<lpage>575</lpage>
</mixed-citation>
</ref>
<ref id="B26">
<mixed-citation publication-type="other">
<name>
<surname>Rätsch</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Sonnenburg</surname>
<given-names>S</given-names>
</name>
<article-title>Accurate splice site detection for
<italic>C. elegans</italic>
</article-title>
<source>Kernel Methods in Computional Biology</source>
<year>2004</year>
<fpage>277</fpage>
<lpage>298</lpage>
</mixed-citation>
</ref>
<ref id="B27">
<mixed-citation publication-type="journal">
<name>
<surname>Leslie</surname>
<given-names>CS</given-names>
</name>
<name>
<surname>Eskin</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Cohen</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Weston</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Noble</surname>
<given-names>WS</given-names>
</name>
<article-title>Mismatch string kernels for discriminative protein classification</article-title>
<source>Bioinformatics</source>
<year>2004</year>
<volume>20</volume>
<issue>4</issue>
<fpage>467</fpage>
<lpage>476</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg431</pub-id>
<pub-id pub-id-type="pmid">14990442</pub-id>
</mixed-citation>
</ref>
<ref id="B28">
<mixed-citation publication-type="journal">
<name>
<surname>Rätsch</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Sonnenburg</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Schälkopf</surname>
<given-names>B</given-names>
</name>
<article-title>RASE: recognition of alternatively spliced exons in
<italic>C.elegans</italic>
</article-title>
<source>Bioinformatics</source>
<year>2005</year>
<volume>21</volume>
<issue>Suppl 1</issue>
<fpage>369</fpage>
<lpage>377</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bti1053</pub-id>
</mixed-citation>
</ref>
<ref id="B29">
<mixed-citation publication-type="book">
<name>
<surname>Mohapatra</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Mishra</surname>
<given-names>PM</given-names>
</name>
<name>
<surname>Padhy</surname>
<given-names>S</given-names>
</name>
<article-title>Discriminative DNA classification and motif prediction using weighted degree string kernels with shift and mismatch</article-title>
<source>Proceedings of ICAC3'09</source>
<year>2009</year>
<publisher-name>ACM, New York, NY, USA</publisher-name>
<fpage>56</fpage>
<lpage>61</lpage>
</mixed-citation>
</ref>
<ref id="B30">
<mixed-citation publication-type="journal">
<name>
<surname>Sonnenburg</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Zien</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Philips</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Rätsch</surname>
<given-names>G</given-names>
</name>
<article-title>POIMs: positional oligomer importance matrices--understanding support vector machine-based signal detectors</article-title>
<source>Bioinformatics</source>
<year>2008</year>
<volume>24</volume>
<issue>13</issue>
<fpage>6</fpage>
<lpage>14</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btn170</pub-id>
</mixed-citation>
</ref>
<ref id="B31">
<mixed-citation publication-type="journal">
<name>
<surname>Natarajan</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Meyer</surname>
<given-names>MR</given-names>
</name>
<name>
<surname>Jackson</surname>
<given-names>BM</given-names>
</name>
<name>
<surname>Slade</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Roberts</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Hinnebusch</surname>
<given-names>AG</given-names>
</name>
<name>
<surname>Marton</surname>
<given-names>MJ</given-names>
</name>
<article-title>Transcriptional profiling shows that GCN4P is a master regulator of gene expression during amino acid starvation in yeast</article-title>
<source>Mol Cell Biol</source>
<year>2001</year>
<volume>21</volume>
<issue>13</issue>
<fpage>4347</fpage>
<lpage>4368</lpage>
<pub-id pub-id-type="doi">10.1128/MCB.21.13.4347-4368.2001</pub-id>
<pub-id pub-id-type="pmid">11390663</pub-id>
</mixed-citation>
</ref>
<ref id="B32">
<mixed-citation publication-type="journal">
<name>
<surname>Hope</surname>
<given-names>IA</given-names>
</name>
<name>
<surname>Struhl</surname>
<given-names>K</given-names>
</name>
<article-title>GCN4, a eukaryotic transcriptional activator protein, binds as a dimer to target DNA</article-title>
<source>EMBO J</source>
<year>1987</year>
<volume>6</volume>
<issue>9</issue>
<fpage>2781</fpage>
<lpage>2784</lpage>
<pub-id pub-id-type="pmid">3678204</pub-id>
</mixed-citation>
</ref>
<ref id="B33">
<mixed-citation publication-type="journal">
<name>
<surname>Hill</surname>
<given-names>DE</given-names>
</name>
<name>
<surname>Hope</surname>
<given-names>IA</given-names>
</name>
<name>
<surname>Macke</surname>
<given-names>JP</given-names>
</name>
<name>
<surname>Struhl</surname>
<given-names>K</given-names>
</name>
<article-title>Saturation mutagenesis of the yeast his3 regulatory site: requirements for transcriptional induction and for binding by GCN4 activator protein</article-title>
<source>Science</source>
<year>1986</year>
<volume>234</volume>
<issue>4775</issue>
<fpage>451</fpage>
<lpage>457</lpage>
<pub-id pub-id-type="doi">10.1126/science.3532321</pub-id>
<pub-id pub-id-type="pmid">3532321</pub-id>
</mixed-citation>
</ref>
<ref id="B34">
<mixed-citation publication-type="journal">
<name>
<surname>Sellers</surname>
<given-names>JW</given-names>
</name>
<name>
<surname>Vincent</surname>
<given-names>AC</given-names>
</name>
<name>
<surname>Struhl</surname>
<given-names>K</given-names>
</name>
<article-title>Mutations that define the optimal half-site for binding yeast GCN4 activator protein and identify an ATF/CREB-like repressor that recognizes similar DNA sites</article-title>
<source>Mol Cell Biol</source>
<year>1990</year>
<volume>10</volume>
<issue>10</issue>
<fpage>5077</fpage>
<lpage>5086</lpage>
<pub-id pub-id-type="pmid">2204805</pub-id>
</mixed-citation>
</ref>
<ref id="B35">
<mixed-citation publication-type="journal">
<name>
<surname>Hinnebusch</surname>
<given-names>AG</given-names>
</name>
<article-title>Translational regulation of GCN4 and the general amino acid control of yeast</article-title>
<source>Annu Rev Microbiol</source>
<year>2005</year>
<volume>59</volume>
<fpage>407</fpage>
<lpage>450</lpage>
<pub-id pub-id-type="doi">10.1146/annurev.micro.59.031805.133833</pub-id>
<pub-id pub-id-type="pmid">16153175</pub-id>
</mixed-citation>
</ref>
<ref id="B36">
<mixed-citation publication-type="journal">
<name>
<surname>Zhu</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Byers</surname>
<given-names>KJRP</given-names>
</name>
<name>
<surname>McCord</surname>
<given-names>RP</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Berger</surname>
<given-names>MF</given-names>
</name>
<name>
<surname>Newburger</surname>
<given-names>DE</given-names>
</name>
<name>
<surname>Saulrieta</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Smith</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Shah</surname>
<given-names>MV</given-names>
</name>
<name>
<surname>Radhakrishnan</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Philippakis</surname>
<given-names>AA</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>De Masi</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Pacek</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Rolfs</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Murthy</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Labaer</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Bulyk</surname>
<given-names>ML</given-names>
</name>
<article-title>High-resolution DNA-binding specificity analysis of yeast transcription factors</article-title>
<source>Genome Res</source>
<year>2009</year>
<volume>19</volume>
<issue>4</issue>
<fpage>556</fpage>
<lpage>566</lpage>
<pub-id pub-id-type="doi">10.1101/gr.090233.108</pub-id>
<pub-id pub-id-type="pmid">19158363</pub-id>
</mixed-citation>
</ref>
<ref id="B37">
<mixed-citation publication-type="journal">
<name>
<surname>Sonnenburg</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Zien</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Rätsch</surname>
<given-names>G</given-names>
</name>
<article-title>Arts: accurate recognition of transcription starts in human</article-title>
<source>Bioinformatics</source>
<year>2006</year>
<volume>22</volume>
<issue>14</issue>
<fpage>472</fpage>
<lpage>480</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btl250</pub-id>
<pub-id pub-id-type="pmid">16352654</pub-id>
</mixed-citation>
</ref>
<ref id="B38">
<mixed-citation publication-type="journal">
<name>
<surname>Sonnenburg</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Schweikert</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Philips</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Behr</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Rätsch</surname>
<given-names>G</given-names>
</name>
<article-title>Accurate splice site prediction using support vector machines</article-title>
<source>BMC Bioinformatics</source>
<year>2007</year>
<volume>8</volume>
<issue>Suppl 10</issue>
<fpage>7</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-8-S10-S7</pub-id>
<pub-id pub-id-type="pmid">17212828</pub-id>
</mixed-citation>
</ref>
<ref id="B39">
<mixed-citation publication-type="journal">
<name>
<surname>Schweikert</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Zien</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Zeller</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Behr</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Dieterich</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Ong</surname>
<given-names>CS</given-names>
</name>
<name>
<surname>Philips</surname>
<given-names>P</given-names>
</name>
<name>
<surname>De Bona</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Hartmann</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Bohlen</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Krüger</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Sonnenburg</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Rätsch</surname>
<given-names>G</given-names>
</name>
<article-title>MGENE: accurate SVM-based gene finding with an application to nematode genomes</article-title>
<source>Genome Res</source>
<year>2009</year>
<volume>19</volume>
<issue>11</issue>
<fpage>2133</fpage>
<lpage>2143</lpage>
<pub-id pub-id-type="doi">10.1101/gr.090597.108</pub-id>
<pub-id pub-id-type="pmid">19564452</pub-id>
</mixed-citation>
</ref>
<ref id="B40">
<mixed-citation publication-type="journal">
<name>
<surname>Saeys</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Abeel</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Degroeve</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Van</surname>
<given-names>Y de Peer</given-names>
</name>
<article-title>Translation initiation site prediction on a genomic scale: beauty in simplicity</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<issue>13</issue>
<fpage>418</fpage>
<lpage>423</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btm177</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000964 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000964 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:4305984
   |texte=   Modeling DNA affinity landscape through two-round support vector regression with weighted degree kernels
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:25605483" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021