Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Machine learning for regulatory analysis and transcription factor target prediction in yeast

Identifieur interne : 000566 ( Pmc/Corpus ); précédent : 000565; suivant : 000567

Machine learning for regulatory analysis and transcription factor target prediction in yeast

Auteurs : Dustin T. Holloway ; Mark Kon ; Charles Delisi

Source :

RBID : PMC:2533145

Abstract

High throughput technologies, including array-based chromatin immunoprecipitation, have rapidly increased our knowledge of transcriptional maps—the identity and location of regulatory binding sites within genomes. Still, the full identification of sites, even in lower eukaryotes, remains largely incomplete. In this paper we develop a supervised learning approach to site identification using support vector machines (SVMs) to combine 26 different data types. A comparison with the standard approach to site identification using position specific scoring matrices (PSSMs) for a set of 104 Saccharomyces cerevisiae regulators indicates that our SVM-based target classification is more sensitive (73 vs. 20%) when specificity and positive predictive value are the same. We have applied our SVM classifier for each transcriptional regulator to all promoters in the yeast genome to obtain thousands of new targets, which are currently being analyzed and refined to limit the risk of classifier over-fitting. For the purpose of illustration we discuss several results, including biochemical pathway predictions for Gcn4 and Rap1. For both transcription factors SVM predictions match well with the known biology of control mechanisms, and possible new roles for these factors are suggested, such as a function for Rap1 in regulating fermentative growth. We also examine the promoter melting temperature curves for the targets of YJR060W, and show that targets of this TF have potentially unique physical properties which distinguish them from other genes. The SVM output automatically provides the means to rank dataset features to identify important biological elements. We use this property to rank classifying k-mers, thereby reconstructing known binding sites for several TFs, and to rank expression experiments, determining the conditions under which Fhl1, the factor responsible for expression of ribosomal protein genes, is active. We can see that targets of Fhl1 are differentially expressed in the chosen conditions as compared to the expression of average and negative set genes. SVM-based classifiers provide a robust framework for analysis of regulatory networks. Processing of classifier outputs can provide high quality predictions and biological insight into functions of particular transcription factors. Future work on this method will focus on increasing the accuracy and quality of predictions using feature reduction and clustering strategies. Since predictions have been made on only 104 TFs in yeast, new classifiers will be built for the remaining 100 factors which have available binding data.

Electronic Supplementary Material

Supplementary material is available in the online version of this article at http://dx.doi.org/10.1007/s11693-006-9003-3 and is accessible for authorized users.


Url:
DOI: 10.1007/s11693-006-9003-3
PubMed: 19003435
PubMed Central: 2533145

Links to Exploration step

PMC:2533145

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Machine learning for regulatory analysis and transcription factor target prediction in yeast</title>
<author>
<name sortKey="Holloway, Dustin T" sort="Holloway, Dustin T" uniqKey="Holloway D" first="Dustin T." last="Holloway">Dustin T. Holloway</name>
<affiliation>
<nlm:aff id="Aff1">Molecular Biology Cell Biology and Biochemistry, Boston University, Boston, MA 02215 USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kon, Mark" sort="Kon, Mark" uniqKey="Kon M" first="Mark" last="Kon">Mark Kon</name>
<affiliation>
<nlm:aff id="Aff2">Department of Mathematics and Statistics, Boston University, Boston, MA 02215 USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">Bioinformatics and Systems Biology, Boston University, Boston, MA 02215 USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Delisi, Charles" sort="Delisi, Charles" uniqKey="Delisi C" first="Charles" last="Delisi">Charles Delisi</name>
<affiliation>
<nlm:aff id="Aff3">Bioinformatics and Systems Biology, Boston University, Boston, MA 02215 USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">19003435</idno>
<idno type="pmc">2533145</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2533145</idno>
<idno type="RBID">PMC:2533145</idno>
<idno type="doi">10.1007/s11693-006-9003-3</idno>
<date when="2006">2006</date>
<idno type="wicri:Area/Pmc/Corpus">000566</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000566</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Machine learning for regulatory analysis and transcription factor target prediction in yeast</title>
<author>
<name sortKey="Holloway, Dustin T" sort="Holloway, Dustin T" uniqKey="Holloway D" first="Dustin T." last="Holloway">Dustin T. Holloway</name>
<affiliation>
<nlm:aff id="Aff1">Molecular Biology Cell Biology and Biochemistry, Boston University, Boston, MA 02215 USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kon, Mark" sort="Kon, Mark" uniqKey="Kon M" first="Mark" last="Kon">Mark Kon</name>
<affiliation>
<nlm:aff id="Aff2">Department of Mathematics and Statistics, Boston University, Boston, MA 02215 USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">Bioinformatics and Systems Biology, Boston University, Boston, MA 02215 USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Delisi, Charles" sort="Delisi, Charles" uniqKey="Delisi C" first="Charles" last="Delisi">Charles Delisi</name>
<affiliation>
<nlm:aff id="Aff3">Bioinformatics and Systems Biology, Boston University, Boston, MA 02215 USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Systems and Synthetic Biology</title>
<idno type="ISSN">1872-5325</idno>
<idno type="eISSN">1872-5333</idno>
<imprint>
<date when="2006">2006</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>High throughput technologies, including array-based chromatin immunoprecipitation, have rapidly increased our knowledge of transcriptional maps—the identity and location of regulatory binding sites within genomes. Still, the full identification of sites, even in lower eukaryotes, remains largely incomplete. In this paper we develop a supervised learning approach to site identification using support vector machines (SVMs) to combine 26 different data types. A comparison with the standard approach to site identification using position specific scoring matrices (PSSMs) for a set of 104
<italic>Saccharomyces cerevisiae</italic>
regulators indicates that our SVM-based target classification is more sensitive (73 vs. 20%) when specificity and positive predictive value are the same. We have applied our SVM classifier for each transcriptional regulator to all promoters in the yeast genome to obtain thousands of new targets, which are currently being analyzed and refined to limit the risk of classifier over-fitting. For the purpose of illustration we discuss several results, including biochemical pathway predictions for Gcn4 and Rap1. For both transcription factors SVM predictions match well with the known biology of control mechanisms, and possible new roles for these factors are suggested, such as a function for Rap1 in regulating fermentative growth. We also examine the promoter melting temperature curves for the targets of YJR060W, and show that targets of this TF have potentially unique physical properties which distinguish them from other genes. The SVM output automatically provides the means to rank dataset features to identify important biological elements. We use this property to rank classifying
<italic>k</italic>
-mers, thereby reconstructing known binding sites for several TFs, and to rank expression experiments, determining the conditions under which Fhl1, the factor responsible for expression of ribosomal protein genes, is active. We can see that targets of Fhl1 are differentially expressed in the chosen conditions as compared to the expression of average and negative set genes. SVM-based classifiers provide a robust framework for analysis of regulatory networks. Processing of classifier outputs can provide high quality predictions and biological insight into functions of particular transcription factors. Future work on this method will focus on increasing the accuracy and quality of predictions using feature reduction and clustering strategies. Since predictions have been made on only 104 TFs in yeast, new classifiers will be built for the remaining 100 factors which have available binding data.</p>
<sec>
<title>Electronic Supplementary Material</title>
<p>Supplementary material is available in the online version of this article at
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1007/s11693-006-9003-3">http://dx.doi.org/10.1007/s11693-006-9003-3</ext-link>
and is accessible for authorized users.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Syst Synth Biol</journal-id>
<journal-title>Systems and Synthetic Biology</journal-title>
<issn pub-type="ppub">1872-5325</issn>
<issn pub-type="epub">1872-5333</issn>
<publisher>
<publisher-name>Kluwer Academic Publishers</publisher-name>
<publisher-loc>Dordrecht</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">19003435</article-id>
<article-id pub-id-type="pmc">2533145</article-id>
<article-id pub-id-type="publisher-id">9003</article-id>
<article-id pub-id-type="doi">10.1007/s11693-006-9003-3</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Machine learning for regulatory analysis and transcription factor target prediction in yeast</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name name-style="western">
<surname>Holloway</surname>
<given-names>Dustin T.</given-names>
</name>
<address>
<email>dth128@bu.edu</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name name-style="western">
<surname>Kon</surname>
<given-names>Mark</given-names>
</name>
<address>
<email>mkon@bu.edu</email>
</address>
<xref ref-type="aff" rid="Aff2">2</xref>
<xref ref-type="aff" rid="Aff3">3</xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name name-style="western">
<surname>DeLisi</surname>
<given-names>Charles</given-names>
</name>
<address>
<email>delisi@bu.edu</email>
</address>
<xref ref-type="aff" rid="Aff3">3</xref>
</contrib>
<aff id="Aff1">
<label>1</label>
Molecular Biology Cell Biology and Biochemistry, Boston University, Boston, MA 02215 USA</aff>
<aff id="Aff2">
<label>2</label>
Department of Mathematics and Statistics, Boston University, Boston, MA 02215 USA</aff>
<aff id="Aff3">
<label>3</label>
Bioinformatics and Systems Biology, Boston University, Boston, MA 02215 USA</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>31</day>
<month>10</month>
<year>2006</year>
</pub-date>
<pub-date pub-type="ppub">
<month>3</month>
<year>2007</year>
</pub-date>
<volume>1</volume>
<issue>1</issue>
<fpage>25</fpage>
<lpage>46</lpage>
<permissions>
<copyright-statement>© Springer Science + Business Media B.V. 2006</copyright-statement>
</permissions>
<abstract>
<p>High throughput technologies, including array-based chromatin immunoprecipitation, have rapidly increased our knowledge of transcriptional maps—the identity and location of regulatory binding sites within genomes. Still, the full identification of sites, even in lower eukaryotes, remains largely incomplete. In this paper we develop a supervised learning approach to site identification using support vector machines (SVMs) to combine 26 different data types. A comparison with the standard approach to site identification using position specific scoring matrices (PSSMs) for a set of 104
<italic>Saccharomyces cerevisiae</italic>
regulators indicates that our SVM-based target classification is more sensitive (73 vs. 20%) when specificity and positive predictive value are the same. We have applied our SVM classifier for each transcriptional regulator to all promoters in the yeast genome to obtain thousands of new targets, which are currently being analyzed and refined to limit the risk of classifier over-fitting. For the purpose of illustration we discuss several results, including biochemical pathway predictions for Gcn4 and Rap1. For both transcription factors SVM predictions match well with the known biology of control mechanisms, and possible new roles for these factors are suggested, such as a function for Rap1 in regulating fermentative growth. We also examine the promoter melting temperature curves for the targets of YJR060W, and show that targets of this TF have potentially unique physical properties which distinguish them from other genes. The SVM output automatically provides the means to rank dataset features to identify important biological elements. We use this property to rank classifying
<italic>k</italic>
-mers, thereby reconstructing known binding sites for several TFs, and to rank expression experiments, determining the conditions under which Fhl1, the factor responsible for expression of ribosomal protein genes, is active. We can see that targets of Fhl1 are differentially expressed in the chosen conditions as compared to the expression of average and negative set genes. SVM-based classifiers provide a robust framework for analysis of regulatory networks. Processing of classifier outputs can provide high quality predictions and biological insight into functions of particular transcription factors. Future work on this method will focus on increasing the accuracy and quality of predictions using feature reduction and clustering strategies. Since predictions have been made on only 104 TFs in yeast, new classifiers will be built for the remaining 100 factors which have available binding data.</p>
<sec>
<title>Electronic Supplementary Material</title>
<p>Supplementary material is available in the online version of this article at
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1007/s11693-006-9003-3">http://dx.doi.org/10.1007/s11693-006-9003-3</ext-link>
and is accessible for authorized users.</p>
</sec>
</abstract>
<kwd-group>
<title>Keywords</title>
<kwd>Transcription factor</kwd>
<kwd>SVM</kwd>
<kwd>Machine learning</kwd>
</kwd-group>
<custom-meta-wrap>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© Springer Science + Business Media B.V. 2007</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Background</title>
<p>Understanding transcriptional regulation is one of the key challenges of the post-genomic era. Transcription factors control the expression of their target genes by binding specific sequences of bases, typically 10–15 nt in length, in a region upstream of transcription initiation. Sequences bound by a TF are not identical to each other and only represent a preferred pattern of nucleotides within a binding motif. The complete regulation of a gene will often depend on the co-operative or antagonistic effects of several transcription factors with potentially overlapping binding sites. Thus, the regulatory code for a gene is composed of a pattern of degenerate motifs concealed within the promoter.</p>
<p>Many methods for predicting additional target sites for a TF have been proposed. Founding work in TF binding site representation involved the use of position specific scoring matrices (PSSMs) (Stormo
<xref ref-type="bibr" rid="CR77">2000</xref>
; Workman and Stormo
<xref ref-type="bibr" rid="CR87">2000</xref>
; Schneider et al.
<xref ref-type="bibr" rid="CR72">1986</xref>
; Schneider and Stephens
<xref ref-type="bibr" rid="CR71">1990</xref>
), which contain the frequency of nucleotide bases at each position in a possible binding site, or motif. New predictions are sites which match the PSSM based on a score threshold (Stormo
<xref ref-type="bibr" rid="CR77">2000</xref>
). Supervised learning tools such as support vector machines (SVM) can be used to categorize new genes when given a set of genes known to be regulated by a certain factor and a set known not to be co-regulated. Unsupervised methods begin with less well-defined information, for example a set of genes from a microarray study which show similar expression over many experiments. Such genes could be hypothesized to be regulated by common factors and thus contain some set of common but unknown sequence patterns in their promoters. These patterns can then be discovered by statistical overrepresentation or by local search algorithms such as Gibbs sampling. Several unsupervised techniques for predicting binding sites have been reported (Conlon et al.
<xref ref-type="bibr" rid="CR19">2003</xref>
; Keles et al.
<xref ref-type="bibr" rid="CR47">2004</xref>
; Wang et al.
<xref ref-type="bibr" rid="CR83">2002</xref>
; Bussemaker et al.
<xref ref-type="bibr" rid="CR14">2001</xref>
; Birnbaum et al.
<xref ref-type="bibr" rid="CR10">2001</xref>
; Zhu et al.
<xref ref-type="bibr" rid="CR92">2002</xref>
; Pritsker et al.
<xref ref-type="bibr" rid="CR67">2004</xref>
; Elemento and Tavazoie
<xref ref-type="bibr" rid="CR23">2005</xref>
), and a comprehensive review of current motif-discovery methods is available (Tompa et al.
<xref ref-type="bibr" rid="CR81">2005</xref>
).</p>
<p>The approach reported here is a supervised pattern classification scheme designed to integrate a large number of heterogeneous data sources in order to more accurately predict the association of a transcription factor and its target. In particular, we explore the use of support vector machines, which are able to incorporate high-dimensional data sets (many features). SVM classifiers have previously been used for the prediction of protein homology (Jaakola et al.
<xref ref-type="bibr" rid="CR46">1999</xref>
), secondary structure (Hua and Sun
<xref ref-type="bibr" rid="CR41">2001a</xref>
), and sub-cellular localization (Hua and Sun
<xref ref-type="bibr" rid="CR42">2001b</xref>
). As sequence classifiers they have also been useful in predicting translation start sites (Zien et al.
<xref ref-type="bibr" rid="CR93">2000</xref>
), mRNA splice sites, and signal peptide cleavage sites (Wang et al.
<xref ref-type="bibr" rid="CR84">2005</xref>
). More broadly they show good performance in the identification of normal and cancerous tissue samples (Furey et al.
<xref ref-type="bibr" rid="CR26">2000</xref>
) as well as prediction of gene function (Pavlidis and Noble
<xref ref-type="bibr" rid="CR63">2001</xref>
).</p>
<p>Few groups have published work on supervised classification schemes for predicting new transcription factor targets. We briefly reviewed some of these previously (Holloway et al.
<xref ref-type="bibr" rid="CR40">2006</xref>
). One method includes linear discriminant analysis (LDA) to select from a set of potentially co-regulated genes those that are most likely to share common transcription factors (Simonis et al.
<xref ref-type="bibr" rid="CR74">2004</xref>
). Another approach uses Bayesian networks to learn the combinatorial relationships of TFs and targets that underlie specific gene expression experiments (Beer and Tavazoie
<xref ref-type="bibr" rid="CR6">2004</xref>
). Finally, in an approach similar to ours, SVMs have been applied to microarray data in order to predict TF–target associations (Qian et al.
<xref ref-type="bibr" rid="CR68">2003</xref>
).</p>
<p>Although some of these techniques work well, they either do not effectively incorporate the large amount of regulatory data available in ChIP–chip interactions or they base their classification on only one or two types of genomic data. Our approach easily combines 26 large genomic datasets, adaptively weighting each data source based on its ability to correctly classify a training set. The combination of heterogeneous data reduces false positive predictions while maintaining high accuracy. Genomic data combination using SVMs has been demonstrated before. Protein sequence similarity, protein–protein interactions, protein hydrophobicity, and gene expression data were successfully combined to predict the functional group of a set of proteins, and the combination of data was shown to significantly outperform individual methods (Lanckriet et al.
<xref ref-type="bibr" rid="CR51">2004</xref>
).</p>
<p>We provide accuracy measurements on our classifiers based on leave-one-out cross validation, and we benchmark our results against randomized datasets. Our full set of predictions for 104 TFs based on all combined methods can be downloaded from our website,
<ext-link ext-link-type="uri" xlink:href="http://www.cagt10.bu.edu/SSBPaper/MachineLearningTFSSB.htm">http://www.cagt10.bu.edu/SSBPaper/MachineLearningTFSSB.htm</ext-link>
.</p>
<sec id="Sec2">
<title>SVMs: background</title>
<p>We consider 26 different datasets sequentially, train a classifier on each, and then construct a composite classifier which is a weighted combination of the 26. For each training set, we develop an allocation rule for every TF. Let
<italic>N</italic>
be the size of the training set for a particular TF (the collection of positive and negative examples, i.e., genes which do and do not bind it). Each gene has a set of attributes forming a vector that contributes to the distinction between positive and negative sets. As an example, an attribute vector for a gene could be an ordered list consisting of the number of times each possible 4-mer occurs in the upstream region. The collection of such vectors is the
<italic>feature space, F</italic>
. Each gene would then be characterized by a 256 component
<italic>feature vector</italic>
. The SVM generates a hyperplane of
<italic>D</italic>
 = 255 dimensions in the feature space separating positives from negatives (
<italic>d</italic>
will henceforth be an index over the features of the dataset). We write a vector in
<italic>F</italic>
as
<bold>x</bold>
<sub>
<italic>i</italic>
</sub>
 = (
<italic>x</italic>
<sub>
<italic>i</italic>
1</sub>
,
<italic>x</italic>
<sub>
<italic>i</italic>
2</sub>
,
<italic>x</italic>
<sub>
<italic>i</italic>
3</sub>
...
<italic>x</italic>
<sub>
<italic>id</italic>
</sub>
), the components
<italic>x</italic>
<sub>
<italic>id</italic>
</sub>
representing, for the example above, the count of the
<italic>d</italic>
th
<italic>k</italic>
-mer in the
<italic>i</italic>
th gene. Then the equation for a hyperplane has the form
<disp-formula id="Equ1">
<label>1</label>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ f({\bf x})={\bf w}\cdot{\bf x}+b=0 $$\end{document}</tex-math>
</disp-formula>
where
<bold>x</bold>
 = (
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
, …,
<italic>x</italic>
<sub>
<italic>d</italic>
</sub>
) and
<bold>w</bold>
≡ (
<italic>w</italic>
<sub>1</sub>
,
<italic>w</italic>
<sub>2</sub>
, …,
<italic>w</italic>
<sub>
<italic>d</italic>
</sub>
). For
<italic>D</italic>
 = 2, this is a straight line in variables
<bold>x</bold>
 = (
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) with slope − 
<italic>w</italic>
<sub>1</sub>
/
<italic>w</italic>
<sub>2</sub>
and intercept − 
<italic>b</italic>
/
<italic>w</italic>
<sub>2</sub>
.</p>
<p>Geometrically
<bold>w</bold>
is a vector perpendicular to the hyperplane
<italic>H</italic>
, the magnitude |
<italic>w</italic>
<sub>
<italic>d</italic>
</sub>
| of its
<italic>d</italic>
th component weighting the corresponding dimension. The function
<italic>f</italic>
(
<bold>x</bold>
) is assumed normalized (through scaling of
<bold>w</bold>
) so that the closest (positive, negative) pair
<bold>x</bold>
<sub arrange="stack">
<italic>i</italic>
</sub>
<sup arrange="stack">+</sup>
and
<bold>x</bold>
<sub arrange="stack">
<italic>i</italic>
</sub>
<sup arrange="stack"></sup>
have values
<italic>f</italic>
(
<bold>x</bold>
<sup>+</sup>
) = 1 and
<italic>f</italic>
(
<bold>x</bold>
<sup></sup>
) =  − 1, respectively. Then the SVM problem is to find
<bold>w</bold>
and
<italic>b</italic>
such that the attribute vectors of all genes in the positive set are above the hyperplane
<italic>H</italic>
<sub>1</sub>
defined by
<disp-formula id="Equa">
<tex-math id="M2">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\bf w}\cdot {\bf x}+b=+1 $$\end{document}</tex-math>
</disp-formula>
and all in the negative set are below hyperplane
<italic>H</italic>
<sub>2</sub>
defined by
<disp-formula id="Equb">
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\bf w}\cdot {\bf x}+b=-1$$\end{document}</tex-math>
</disp-formula>
and that the
<italic>margin</italic>
(distance between
<italic>H</italic>
<sub>1</sub>
and
<italic>H</italic>
<sub>2</sub>
) is maximal. Thus the goal is to find a separator that maximizes the margin, or distance between the positive and negative classes. This construction is essentially a choice of scaling for
<bold>w</bold>
,
<italic>b</italic>
, in particular requiring that the length |
<bold>w</bold>
| be minimal, since this maximizes the margin under the above normalization. Maximizing the margin is a
<italic>convex optimization</italic>
problem which is generally solved using standard Lagrangian methods (Sholkopf and Smola
<xref ref-type="bibr" rid="CR73">2002</xref>
). Typically, as in our case, perfect separation cannot be achieved. When error-free decisions are not possible the method can be readily generalized to allow any specified amount of misclassification, with a suitable penalty function.</p>
<p>An important aspect of the solution is that the data enter only in the form of a
<italic>kernel matrix K</italic>
, whose entries
<italic>K</italic>
<sub>
<italic>ij</italic>
</sub>
are dot products of all pairs
<bold>x</bold>
<sub>
<italic>i</italic>
</sub>
,
<bold>x</bold>
<sub>
<italic>j</italic>
</sub>
of feature vectors. In the case that all components of the feature vector are truly independent, the Lagrangian is a linear function of the elements of the kernel, and the linear dot product is used with
<italic>K</italic>
<sub>
<italic>ij</italic>
</sub>
 = 
<bold>x</bold>
<sub>
<italic>i</italic>
</sub>
·
<bold>x</bold>
<sub>
<italic>j</italic>
</sub>
. When the elements are correlated, the Lagrangian is written as a non-linear function of the inner products of the attribute vectors (see below). In particular, the non-linear dot products are defined for data points by
<italic>K</italic>
<sub>
<italic>ij</italic>
</sub>
 = 
<italic>K</italic>
(
<bold>x</bold>
<sub>
<italic>i</italic>
</sub>
,
<bold>x</bold>
<sub>
<italic>j</italic>
</sub>
), where the given positive definite function
<italic>K</italic>
(
<bold>x, y</bold>
) is known as the
<italic>kernel function</italic>
. Such non-linear products are equivalent to assuming that an unspecified higher dimensional feature space
<italic>F</italic>
<sub>1</sub>
exists into which
<italic>F</italic>
is mapped and in which the separating hyperplane is linear. This yields a Lagrangian with matrix entries given by this alternative dot product. The implicit choice of
<italic>F</italic>
<sub>1</sub>
is made by changing the type of inner product used (see Table 
<xref rid="Tab1" ref-type="table">1</xref>
). For a more detailed development of SVMs, see the excellent reference texts (Sholkopf and Smola
<xref ref-type="bibr" rid="CR73">2002</xref>
, Tan et al.
<xref ref-type="bibr" rid="CR78">2005</xref>
). For a detailed two-dimensional example see Holloway et al. (
<xref ref-type="bibr" rid="CR40">2006</xref>
).
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>Four common kernels tested</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Kernel</th>
<th align="left">Parameters</th>
<th align="left">Description</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Linear</td>
<td align="left">None</td>
<td align="left">
<italic>K</italic>
(
<bold>x, y</bold>
) = 
<bold>x</bold>
·
<bold>y</bold>
</td>
</tr>
<tr>
<td align="left">Polynomial</td>
<td align="left">Poly degree d</td>
<td align="left">
<italic>K</italic>
(
<bold>x, y</bold>
) = (
<bold>x</bold>
·
<bold>y</bold>
 + 1)
<sup>
<italic>d</italic>
</sup>
</td>
</tr>
<tr>
<td align="left">Gaussian radial basis function (RBF)</td>
<td align="left">σ</td>
<td align="left">
<inline-formula id="IEq4">
<pmc-comment> Alternate image not processed: 11693_2006_9003_ArticleIEq4.gif </pmc-comment>
<tex-math id="M4">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K ({\bf x, y})=\exp \left( \frac{-\vert {\bf x}-{\bf y}\vert^2}{{\bf 2}\sigma ^2}\right)$$\end{document}</tex-math>
</inline-formula>
</td>
</tr>
<tr>
<td align="left">Gaussian</td>
<td align="left">σ</td>
<td align="left">
<inline-formula id="IEq5">
<pmc-comment> Alternate image not processed: 11693_2006_9003_ArticleIEq5.gif </pmc-comment>
<tex-math id="M5">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K ({\bf x, y})=\frac{1}{{\bf 2}\pi \sigma ^2}e^{-\frac{x^2+y^2}{2\sigma ^2}}$$\end{document}</tex-math>
</inline-formula>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>These are the four common kernel functions, the parameters which must be set by the user, and their mathematical description</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>Post-processing can be an essential task in pattern classification problems, particularly if one wishes to extract the highest quality predictions from a classifier. A naïve way to extract the most significant (positive) prediction from an SVM classifier is to select those data points which are most distant from the separator (distance given by
<bold>w</bold>
·
<bold>x</bold>
<sub>
<italic>i</italic>
</sub>
 + 
<italic>b</italic>
for data point
<italic>i</italic>
). The interpretation is that those distant points are most unlike the negative set and contain the strongest positive character. A more informative method is to rank data by
<italic>P</italic>
(
<italic>y</italic>
<sub>
<italic>i</italic>
</sub>
 = 1|
<bold>w</bold>
·
<bold>x</bold>
<sub>
<italic>i</italic>
</sub>
 + 
<italic>b</italic>
); i.e. by the posterior probability of a positive classification, given the distance of example
<bold>x</bold>
<sub>
<italic>i</italic>
</sub>
from the hyperplane. Platt observed that these posterior probabilities could be well approximated by fitting the SVM output to the form of a sigmoid function (Platt
<xref ref-type="bibr" rid="CR66">1999</xref>
), and developed a procedure to generate the best-fit sigmoid to an SVM output for any dataset. The result is the posterior probability
<italic>P</italic>
(
<italic>y</italic>
<sub>
<italic>i</italic>
</sub>
 = 1|
<bold>w</bold>
·
<bold>x</bold>
<sub>
<italic>i</italic>
</sub>
 + 
<italic>b</italic>
) for each data point in the training set (see Platt
<xref ref-type="bibr" rid="CR66">1999</xref>
) for further details). This probability places a confidence level on any new prediction made in the yeast genome and, most importantly, results in an ability to identify high-confidence predictions for future experiments.</p>
</sec>
</sec>
<sec id="Sec3" sec-type="methods">
<title>Methods</title>
<p>We have tested a variety of sequence and non-sequence based classifiers for predicting the association of TFs and genes. All together 26 separate data sources (each yielding a feature map and kernel) are combined to build classifiers for each transcription factor. The 26 data sources comprise a family of sequence-based methods (e.g.,
<italic>k</italic>
-mer counts, TF motif conservation in multiple species, etc), expression data sets, phylogenetic profiles, gene ontology (GO) functional profiles, and DNA structural information such as promoter melting temperature, DNA bending, and DNA accessibility predictions (see Table 
<xref rid="Tab2" ref-type="table">2</xref>
).
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>Abbreviations of datasets used to generate classifiers</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"> </th>
<th align="left">Abbreviation</th>
<th align="left">Description</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">1</td>
<td align="left">MOT</td>
<td align="left">Motif hits in
<italic>S. cerevisiae</italic>
</td>
</tr>
<tr>
<td align="left">2</td>
<td align="left">CON</td>
<td align="left">Motif hits conservation 18 organisms</td>
</tr>
<tr>
<td align="left">3</td>
<td align="left">PHY</td>
<td align="left">Phylogenetic profile</td>
</tr>
<tr>
<td align="left">4</td>
<td align="left">EXP</td>
<td align="left">Expression correlation</td>
</tr>
<tr>
<td align="left">5</td>
<td align="left">GO</td>
<td align="left">GO term profile</td>
</tr>
<tr>
<td align="left">6</td>
<td align="left">KMER</td>
<td align="left">
<italic>K</italic>
-mers—4,5,6-mers</td>
</tr>
<tr>
<td align="left">7</td>
<td align="left">S1</td>
<td align="left">Split 6-mer 1 gap kkk_kkk</td>
</tr>
<tr>
<td align="left">8</td>
<td align="left">S2</td>
<td align="left">Split 6-mer 2 gaps kkk__kkk</td>
</tr>
<tr>
<td align="left">9</td>
<td align="left">S3</td>
<td align="left">Split 6-mer 3 gaps kkk___kkk</td>
</tr>
<tr>
<td align="left">10</td>
<td align="left">S4</td>
<td align="left">Split 6-mer 4 gaps kkk____kkk</td>
</tr>
<tr>
<td align="left">11</td>
<td align="left">S5</td>
<td align="left">Split 6-mer 5 gaps kkk_____kkk</td>
</tr>
<tr>
<td align="left">12</td>
<td align="left">S6</td>
<td align="left">Split 6-mer 6 gaps kkk______kkk</td>
</tr>
<tr>
<td align="left">13</td>
<td align="left">S7</td>
<td align="left">Split 6-mer 7 gaps kkk_______kkk</td>
</tr>
<tr>
<td align="left">14</td>
<td align="left">S8</td>
<td align="left">Split 6-mer 8 gaps kkk________kkk</td>
</tr>
<tr>
<td align="left">15</td>
<td align="left">M01</td>
<td align="left">6-mer with 1 mismatch (count 0.1)</td>
</tr>
<tr>
<td align="left">16</td>
<td align="left">M05</td>
<td align="left">6-mer with 1 mismatch (count 0.5)</td>
</tr>
<tr>
<td align="left">17</td>
<td align="left">ENT</td>
<td align="left">Condition specific TF–target correlation</td>
</tr>
<tr>
<td align="left">18</td>
<td align="left">BIT</td>
<td align="left">Nucleotide sparse binary encoding</td>
</tr>
<tr>
<td align="left">19</td>
<td align="left">CRV</td>
<td align="left">Promoter curvature prediction</td>
</tr>
<tr>
<td align="left">20</td>
<td align="left">HC</td>
<td align="left">Homolog conservation</td>
</tr>
<tr>
<td align="left">21</td>
<td align="left">HYD</td>
<td align="left">Hydroxyl cleavage</td>
</tr>
<tr>
<td align="left">22</td>
<td align="left">KPo</td>
<td align="left">Kmer median positions from start</td>
</tr>
<tr>
<td align="left">23</td>
<td align="left">KPr</td>
<td align="left">Kmer Probabilities (− log
<italic>p</italic>
val)</td>
</tr>
<tr>
<td align="left">24</td>
<td align="left">MT</td>
<td align="left">Promoter melting temperature − 20 bp window</td>
</tr>
<tr>
<td align="left">25</td>
<td align="left">DG</td>
<td align="left">Promoter melting Delta G profile − 20 bp win</td>
</tr>
<tr>
<td align="left">26</td>
<td align="left">BND</td>
<td align="left">Promoter bend prediction</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Abbreviations for each dataset and a short description are given</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>Our positive and negative training sets are taken from ChIP–chip experiments (Harbison et al.
<xref ref-type="bibr" rid="CR33">2004</xref>
; Lee et al.
<xref ref-type="bibr" rid="CR52">2002</xref>
), Transfac 6.0 Public (Matys et al.
<xref ref-type="bibr" rid="CR59">2005</xref>
), and a list curated by Young et al. from which we have excluded indirect evidence such as sequence analysis and expression correlation (Young Lab Web Data,
<ext-link ext-link-type="uri" xlink:href="http://www.staffa.wi.mit.edu/cgi-bin/young_public/navframe.cgi?s=17&f=evidence">http://www.staffa.wi.mit.edu/cgi-bin/young_public/navframe.cgi?s=17&f=evidence</ext-link>
). Only ChIP–chip interactions of
<italic>p</italic>
-value ≤ 10
<sup>−3</sup>
(i.e., a high confidence level) are considered positives (Harbison et al.
<xref ref-type="bibr" rid="CR33">2004</xref>
). The Transfac and curated list represent a manually annotated set which will later be used separately during SVM comparison to PSSM performance. For the purposes of SVM, however, all manually curated and high-throughput sets are grouped together, making a total of 9,104 positive interactions.</p>
<p>Negative sets pose a greater challenge since no defined negatives exist in the literature; however, since a particular TF will regulate only a small fraction of the genome, a random choice of negatives seems acceptable. In fact, test cases with a few TFs show good classification performance with random negatives (unpublished work). Nevertheless, a safer set of negatives would be those showing no binding by experiment under some set of conditions. Along those lines, we have chosen for each TF 175 genes with the highest
<italic>p</italic>
-values (generally > 0.8) under all conditions tested in genomic ChIP–chip analyses (Harbison et al.
<xref ref-type="bibr" rid="CR33">2004</xref>
; Lee et al.
<xref ref-type="bibr" rid="CR52">2002</xref>
). Clearly all experimental conditions have not been sampled and this does not guarantee that our choices are truly never bound by the TF, but this choice of negatives should maximize our chances of selecting genes not regulated by the TF of interest.</p>
<p>All promoter sequences have been collected from RSA tools (van Helden
<xref ref-type="bibr" rid="CR35">2003</xref>
), Ensembl (Birney et al.
<xref ref-type="bibr" rid="CR11">2006</xref>
), or the Broad Institute’s Fungal Genome Initiative (Galagan et al.
<xref ref-type="bibr" rid="CR27">2003</xref>
; Dean
<xref ref-type="bibr" rid="CR21">2005</xref>
). For yeast, promoters are defined as the 800 bp upstream of the coding sequence. The motif hit conservation dataset required promoter regions from 17 other genomes. Those genomes, their sources, and the length of the promoter regions are described in our previous report (Holloway et al.
<xref ref-type="bibr" rid="CR40">2006</xref>
). Sequences are masked using the dust algorithm and the RepeatMasker software (Tatusov and Lipman
<xref ref-type="bibr" rid="CR79">2005</xref>
; Smit et al.
<xref ref-type="bibr" rid="CR75">2005</xref>
) where appropriate, to exclude low complexity sequences and known repeat DNA from further analysis. PSSM scans (for datasets 1 and 2, below) are performed with the MotifScanner algorithm (Aerts et al.
<xref ref-type="bibr" rid="CR2">2003</xref>
). MotifScanner assumes a sequence model where regulatory elements are distributed within a noisy background sequence (Aerts et al.
<xref ref-type="bibr" rid="CR2">2003</xref>
). The algorithm requires input of a background sequence model, which in this case is a transition matrix of a third order Markov model generated from the masked upstream regions of each genome. MotifScanner only requires one parameter be set by the user, i.e. the threshold score for accepting a motif as a binding site. Several thresholds have been tested and the results we have used to create SVM kernels are all at a setting of 0.15, which has been found to be a reasonable middle ground, making approximately 560 predictions per TF. Settings beyond 0.2 produce too many false hits to be useful. The PSSMs themselves are obtained from Transfac 6.0 Public and from (Harbison et al.
<xref ref-type="bibr" rid="CR32">2005</xref>
), which are a mix of experimentally derived motifs and those generated by motif-discovery procedures.</p>
<p>Datasets using
<italic>k</italic>
-mers rather than PSSMs are generated using the fasta2matrix (Pavlidis et al.
<xref ref-type="bibr" rid="CR64">2004</xref>
) program which lists all possible
<italic>k</italic>
-mers and counts the occurrence of each within a set of promoters. Gapped
<italic>k</italic>
-mers are detected using custom scripts written as Matlab m-files. The expression data used include 1011 microarray experiments compiled by Ihmels and co-workers, which can be downloaded with permission from the authors (Ihmels et al.
<xref ref-type="bibr" rid="CR44">2005</xref>
).</p>
<p>Each data set is normalized so that each feature in the training set has mean of 0 and standard deviation of 1. Gene Ontology, phylogenetic profile, and TF–target correlation data are not normalized since their data are binary. Finally, since the ultimate goal is data integration the number of training examples for a given TF must be the same for every dataset used to make a classifier. When examples are missing in a dataset, as is the case with the GO and COG (phylogenetic profiles based on the Clusters of Orthologous Groups database) based classifiers, random values sampled from the rest of the training set are used to fill in the missing vectors.</p>
<p>All classifier construction and validation was performed in Matlab (The Mathworks:
<ext-link ext-link-type="uri" xlink:href="http://www.mathworks.com/">http://www.mathworks.com/</ext-link>
) using the Spider machine learning library (Weston et al.
<xref ref-type="bibr" rid="CR85">2005</xref>
). Mapping of predicted binding targets to biological pathways was done using the Pathway Tools Omics Viewer at SGD (Christie et al.
<xref ref-type="bibr" rid="CR16">2004</xref>
). See our supplementary methods section for an expanded description of the analyses below.</p>
<sec id="Sec4">
<title>Description of analysis</title>
<p>A separate classifier is developed for each TF based on each independent dataset. The four kernel functions in Table 
<xref rid="Tab1" ref-type="table">1</xref>
(linear, rbf, Gaussian, and polynomial) are tested using leave one out cross validation, and the function with the highest
<italic>F</italic>
<sub>1</sub>
score (below) is chosen as best for that particular TF–dataset combination. A flow diagram of our method can be seen in Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
. Let TP denote the count of true positives, FN false negatives, etc. The
<italic>F</italic>
<sub>1</sub>
statistic is a robust measure that represents a harmonic mean between sensitivity (
<italic>S</italic>
), and positive predictive value (PPV). It is defined by
<disp-formula id="Equc">
<tex-math id="M6">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ F_1=\frac{2\times S\times \hbox{PPV}}{S+\hbox{PPV}}=\frac{2\times \hbox{TP}}{2\times\hbox{TP}+\hbox{FP}+\hbox{FN}}$$\end{document}</tex-math>
</disp-formula>
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>Flow diagram: synthesizing a single classifier for each TF from several data sets. A classifier is constructed for each individual TF for each genomic dataset, using every one of four possible kernel functions (26 datasets ×  104 TFs  ×  4 kernel functions = 10816 kernels from which SVM classifiers are built). For each of these classifiers optimal parameters are chosen by cross-validation. For each dataset and each TF, the best performing of the four kernel functions is selected, reducing the number of classifiers to 2704 (26 datasets  ×  104TFs). Finally, the datasets are combined based on
<italic>F</italic>
<sub>1</sub>
score of their best performing kernel so that there is only one classifier per TF</p>
</caption>
<graphic position="anchor" xlink:href="11693_2006_9003_Fig1_HTML" id="MO9"></graphic>
</fig>
</p>
<p>If we choose the classifier with the best
<italic>F</italic>
<sub>1</sub>
statistic, each TF now has one classifier for each type of genomic data (26 classifiers total). For every classifier the
<italic>C</italic>
parameter (the trade-off between training error and margin) must be specified, and some kernel functions require a second parameter, e.g., the polynomial degree
<italic>k</italic>
for a polynomial kernel or a standard deviation σ (which controls the scaling of data in the feature space) for a Gaussian or radial basis function (RBF) kernel. The values for these parameters are chosen by a grid-selection procedure in which many values are tested over a specified range using 5-fold cross validation. The ROC score is used to choose the best values. As an example for an RBF kernel a range of
<italic>C</italic>
values from 2
<sup>−5</sup>
to 200 is tested with a range of σ values from 2
<sup>−15</sup>
to 2
<sup>3</sup>
. The best combination of values is then chosen to make the final classifier.</p>
<p>The performance of any parameter-optimized classifier is determined using leave-one-out cross validation. Once the best kernel function
<italic>K</italic>
(
<bold>x</bold>
,
<bold>y</bold>
) (with optimized parameter values) has been chosen for a particular TF–dataset pair, the next step is to combine the datasets to create a composite classifier. To that end, the
<italic>K</italic>
(
<bold>x</bold>
,
<bold>y</bold>
) is used to create a kernel matrix for each of the 26 datasets. Before weighting and combining kernels for each data set, all kernel matrices are normalized according to
<disp-formula id="Equd">
<tex-math id="M7">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \tilde {K}(x,y)=\frac{K(x,y)}{\sqrt {K(x,x)K(y,y)}}.$$\end{document}</tex-math>
</disp-formula>
</p>
<p>This normalization effectively adjusts all points to lie on a unit hypersphere in the feature space
<italic>F</italic>
, and the diagonal elements in every kernel matrix the will be 1. This assures that no single kernel has matrix values that are comparatively larger or smaller than other kernels, so all matrices initially have the same contribution to the combination.</p>
<p>Datasets can be combined by adding kernel matrices together; however, an unweighted linear combination ignores dataset dependent performance—in fact some datasets do not perform better than random for some TFs. To avoid this problem, we determine whether the number of true positives predicted using a particular dataset is significantly different (
<italic>p</italic>
 ≤ 0.05) than what would be achieved by random guessing. We calculate the probability of observing more than
<italic>g</italic>
true positives given the training set size
<italic>N</italic>
, the total number of known positives
<italic>L</italic>
(i.e., TP + FN), and the number of positively classified examples,
<italic>M</italic>
(i.e., TP + FP)
<disp-formula id="Eque">
<tex-math id="M8">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \begin{aligned} p=&P(g\geq x)=1-F(x-1\vert N,L,M)=1-\sum\limits_{i=0}^{x-1} {\frac{\left({{\begin{array}{l} L\\i\\ \end{array}}} \right)\left( {{\begin{array}{l} {N-L}\\ {M-i}\\\end{array}}} \right)}{\left( {{\begin{array}{l} N\\ M\\ \end{array}}} \right)}}\hbox{ for } x > 0; \\ p=&1\hbox{ otherwise}. \end{aligned} $$\end{document}</tex-math>
</disp-formula>
Here
<italic>p</italic>
is the probability of drawing
<italic>x</italic>
or more true positives at random. Datasets that do not meet the
<italic>p</italic>
-value cutoff are eliminated from the analysis for a particular TF.</p>
<p>Finally, the significant datasets (each represented by a kernel matrix
<italic>K</italic>
<sub>
<italic>ij</italic>
</sub>
) must be weighted based on their performance. Using a scheme (described below) with weights equal to the
<italic>F</italic>
<sub>1</sub>
score of each classifier, the underlying 26 kernel matrices are scaled and added into a single unified kernel corresponding to the given transcription factor. Once the weighting is complete, an overall leave-one-out cross-validation is employed to estimate the error of the combined classifier. Although individual kernels were tuned on the entire set of examples for each dataset independently, the
<italic>C</italic>
parameter of the final, combined SVM was determined only on the training set during cross-validation. Nevertheless, to measure the danger of overfitting the most useful performance benchmark is perhaps the random data controls shown in Fig. 
<xref rid="Fig2" ref-type="fig">2</xref>
. Also, the use of Platt’s posterior probabilities as a post-processing filter can help in choosing the truly relevant targets once the procedure is applied to the entire genome. As further validation we employed an alternative scheme for data combination on a few test cases. The feature vectors for several datasets were directly concatenated and recursive feature elimination (Guyon et al.
<xref ref-type="bibr" rid="CR31">2002</xref>
) was applied to select the most relevant features for classifier construction completely independent of test data. This is a more computationally intensive procedure requiring many datasets to be loaded into memory simultaneously and hundreds of SVMs to be fit iteratively in order to weight data features. The results for these tests appeared similar to the results obtained by the procedures outlined in this manuscript, and we will describe these results on a larger set of transcription factors in a future publication.
<fig id="Fig2">
<label>Fig. 2</label>
<caption>
<p>SVM performance. Performance of each dataset and combined datasets ordered by increasing
<italic>F</italic>
<sub>1</sub>
score. Cumulative results for all transcription factors were used to plot the sensitivity, positive-predictive-value, and the
<italic>F</italic>
<sub>1</sub>
statistic for each dataset and data combination. Dataset abbreviations are given in Table 
<xref rid="Tab3" ref-type="table">3</xref>
. The combined classifiers, labeled 26st (linear weighting), 26sq (square weighting), and 26t (tangent square weighting) on the far right, perform better than any dataset alone, with the squared tangent weighting giving the best result overall. Three random datasets also appear in the table, R (randomized
<italic>k</italic>
-mer counts), RH (randomized 10% selection of each dataset), and RN (normally distributed random numbers)</p>
</caption>
<graphic position="anchor" xlink:href="11693_2006_9003_Fig2_HTML" id="MO10"></graphic>
</fig>
</p>
<p>Three simple weighting schemes have been compared. In all cases the primary weight for a method is determined by computing its ratio with the best performing method. Our first weighting scheme is linear and simply multiplies the
<italic>m</italic>
th matrix
<italic>K</italic>
<sup>
<italic>m</italic>
</sup>
 = 
<italic>K</italic>
<sub arrange="stack">
<italic>ij</italic>
</sub>
<sup arrange="stack">
<italic>m</italic>
</sup>
by its scaled
<italic>F</italic>
<sub>1</sub>
score α
<sub>
<italic>m</italic>
</sub>
and computes a sum, yielding
<inline-formula id="IEq1">
<pmc-comment> Alternate image not processed: 11693_2006_9003_ArticleIEq1.gif </pmc-comment>
<tex-math id="M9">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K_=\sum\limits_{m=1}^{26}{\alpha_m K^m}$$\end{document}</tex-math>
</inline-formula>
. A second scheme is non-linear and squares the weights of the first method before multiplying, yielding
<inline-formula id="IEq2">
<pmc-comment> Alternate image not processed: 11693_2006_9003_ArticleIEq2.gif </pmc-comment>
<tex-math id="M10">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=\sum\limits_{m=1}^{26} {\alpha _m^2 K^m}$$\end{document}</tex-math>
</inline-formula>
. This will not change the weight of the best performing method, which will be scaled to 1, but will decrease the relative weights of poorer methods. Our third scheme, which is the most non-linear, takes the squared tangent (an effective sigmoidal function) of the primary weight, yielding
<inline-formula id="IEq3">
<pmc-comment> Alternate image not processed: 11693_2006_9003_ArticleIEq3.gif </pmc-comment>
<tex-math id="M11">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=\sum\limits_{m=1}^{26} {({\tan ^2\alpha_m})K^m}$$\end{document}</tex-math>
</inline-formula>
. This more steeply penalizes poorly performing methods while increasing relative weights of the best methods (e.g., instead of weight 1, the best method will have a weight of 2.43).</p>
</sec>
<sec id="Sec5">
<title>Genomic datasets</title>
<sec id="Sec6">
<title>1 PSSM motif counts (MOT, Table 
<xref rid="Tab2" ref-type="table">2</xref>
item 1)</title>
<p>Position-specific weight matrices (PSSM) for 104 transcription factors have been used to scan 800 bp promoters in
<italic>S. cerevisiae</italic>
for each gene in a training set, and the number of hits for each PSSM has been counted. These counts are the features (i.e., components) of 104 dimensional feature vectors. It is clear that a greater number of “hits” by a PSSM in the upstream region of a gene will imply a greater likelihood that the TF corresponding to the matrix will actually bind the gene. For each prediction there is a probability that it will be true,
<italic>P</italic>
(True|hit). If a certain upstream region of a gene has more than one hit, the probability that the TF binds to the gene will increase (Supplementary Figure 1). This method aims to better predict TF binding by taking into account the number and types of binding motifs in a promoter.</p>
</sec>
<sec id="Sec7">
<title>2 PSSM hit conservation (Table 
<xref rid="Tab2" ref-type="table">2</xref>
item 2)</title>
<p>Conservation of a TF binding site is determined by counting hits of the TF probability matrix (PSSM) in orthologous upstream regions from several organisms. Orthology information was taken from the Homologene database (Wheeler et al.
<xref ref-type="bibr" rid="CR86">2005</xref>
) for all organisms except for sensu stricto and sensu lato yeasts, which was obtained from Washington University and the Whitehead Broad Institute (Cliften et al.
<xref ref-type="bibr" rid="CR17">2003a</xref>
,
<xref ref-type="bibr" rid="CR18">b</xref>
; Kellis et al.
<xref ref-type="bibr" rid="CR49">2003</xref>
; Kellis
<xref ref-type="bibr" rid="CR48">2003</xref>
).</p>
<p>In this analysis, a hit by a PSSM in the upstream region of an ortholog is defined as a conserved motif. In this way, conservation of a
<italic>potential binding site</italic>
is being measured rather than the exact nucleotide string. This is because a PSSM may identify sequences that are different in nucleotide composition but still match the probability matrix. This is a loose conservation criterion that makes sense biologically, since natural selection will act to preserve a binding site, and not necessarily an exact nucleotide string.</p>
<p>The stronger the conservation of a potential binding site, the more likely the site is to be real (See Supplementary Figure 2). These data are assembled into a 104 dimensional feature vector for each gene in yeast. Each feature represents a transcription factor motif and the value of the attribute is the number of genomes in which the binding site is conserved.</p>
</sec>
<sec id="Sec8">
<title>3 Kmers, mismatch kmers, and gapped kmers (Table 
<xref rid="Tab2" ref-type="table">2</xref>
and 6–16)</title>
<p>PWMs may fail to detect binding sites if the binding site collection used to generate them is incomplete (in the case of experimental data) or if the motif discovery procedure is inaccurate (as may occur in the case of computationally generated matrices). In this case, the distribution of all
<italic>k</italic>
-mers in a gene’s promoter may be used to predict whether it is bound or not-bound by a TF.
<italic>K</italic>
-mer counts in promoters have been used previously with SVMs to predict genes’ functions (Pavlidis and Noble
<xref ref-type="bibr" rid="CR63">2001</xref>
). Here, several strategies are used to generate a variety of datasets based on
<italic>k</italic>
-mer strings. First, one dataset of feature vectors is created by decomposing all yeast promoters into counts of all
<italic>k</italic>
-mers of length 4, 5, and 6. Similarly, 6-mers with variable length center gaps (of the form
<italic>kkk</italic>
 − {
<italic>x</italic>
}
<sub>
<italic>n</italic>
</sub>
 − 
<italic>kkk</italic>
) are counted in each promoter to form sequence datasets allowing gaps of size 1–8 (Table 
<xref rid="Tab2" ref-type="table">2</xref>
, items 4–11). This allows detection of split motifs such as the binding site for Abf1, RTCRYNNNNNACGR. Finally, we construct two datasets with 6-mer counts allowing one mismatch in any 6-mer (Table 
<xref rid="Tab2" ref-type="table">2</xref>
items 12–13). A mismatched base pair is counted with a value of 0.1 in the first dataset, and 0.5 in the second.</p>
<p>Given a set of true positives and true negatives for each TF, the SVM classifies genes based on their complete promoter content as represented by these
<italic>k</italic>
-mer distributions. As we point out in the “Discussion section”,
<italic>k</italic>
-mer counts are the single best performing method for distinguishing transcription factor targets.</p>
<p>It should be noted that our sequence based kernels are very similar to sequence kernels used in previous work. Specifically, our kernels are inspired by the spectrum kernel (Leslie et al.
<xref ref-type="bibr" rid="CR54">2002</xref>
), the (
<italic>g</italic>
,
<italic>k</italic>
)-gappy kernel (Leslie and Kuang
<xref ref-type="bibr" rid="CR53">2003</xref>
) and the mismatch kernel (Leslie et al.
<xref ref-type="bibr" rid="CR55">2004</xref>
) which have been proposed for sequence classification (see Supplementary Methods for a more complete description). Finally, the kernels used here take into account the reverse complements of each
<italic>k</italic>
-mer. This means, for instance, that the 3-mers “AAA”, and “TTT” are counted together as one unit since the presence of one necessitates the other on the opposite strand of DNA.</p>
</sec>
<sec id="Sec9">
<title>4 GO annotation (Table 
<xref rid="Tab2" ref-type="table">2</xref>
item 5)</title>
<p>GO term annotation can be used to detect possible transcriptional targets. The targets of a transcription factor have often been shown to have similar function and a gene’s GO annotation can be used to measure its functional similarity to known targets (Allocco et al.
<xref ref-type="bibr" rid="CR3">2004</xref>
). For this method, all GO Biological Process terms in yeast become features for genes, such that every gene will have a binary vector, with a 1 for the terms which are annotated to it, and 0 otherwise. Parent terms of direct annotations also receive a 1. There are 2,155 possible terms for yeast, giving a vector of the same length. Since only about one-third of yeast genes are annotated with GO terms, a feature matrix generated with GO data is sparse, consisting mostly of zeros. Imputing zeros for genes unannotated in GO can potentially bias the result of the classifier (for instance, if many negatives are missing and hence are described using zero vectors it may be trivial to separate these from the positives). Instead, the binary vector is filled in with random data according to the background distribution of term annotation in the yeast genome. Despite using random data, the vectors are still sparse and the best 800 GO terms are selected using the Fisher score criterion during the classifier construction for each TF. The Fisher criterion gives high scores to features that have large differences in mean between the positive and negative classes in relation to variance. This feature selection is performed in the SPIDER data mining package (Bishop
<xref ref-type="bibr" rid="CR12">1995</xref>
).</p>
</sec>
<sec id="Sec10">
<title>5 Phylogenetic profiles (Table 
<xref rid="Tab2" ref-type="table">2</xref>
item 3)</title>
<p>Co-evolution of a transcription factor’s targets may indicate regulation. A phylogenetic profile of a gene is simply the pattern of occurrence of its orthologs across a set of genomes. Genes with similar patterns have been shown to participate in the same physical complexes or have similar biochemical roles within the cell (Wu et al.
<xref ref-type="bibr" rid="CR88">2003</xref>
). It has also been postulated that transcription factors and their targets co-evolve (Gasch et al.
<xref ref-type="bibr" rid="CR29">2004</xref>
). Therefore it seems reasonable that a group of commonly regulated genes could share a similar pattern of inheritance. Phylogenetic profiles here were parsed from the COG database, which contains orthology information between
<italic>S. cerevisiae</italic>
and 65 other microbial genomes. Each gene in the positive and negative set is represented by a 65 component binary vector, a component being 1 if the gene’s ortholog is present in the corresponding genome, and zero otherwise. As with the GO data, gene attribute vectors are binary, containing 65 elements, one for each genome in COG. Also, since many genes have not been annotated to COG groups, it is necessary to generate random vectors for missing genes as described for the GO example above.</p>
</sec>
<sec id="Sec11">
<title>6 TF–target expression correlation as a method to predict regulation</title>
<p>Analysis of transcription factor motif-matching outputs shows that false positive predictions are numerous even in cases of low sensitivity. Expression analysis provides a means to discover targets missed by sequence based methods. Several studies have shown that genes with similar expression patterns are likely to share similar regulation and, conversely, genes regulated by the same TF are more likely to be co-expressed (Allocco et al.
<xref ref-type="bibr" rid="CR3">2004</xref>
; Yu et al.
<xref ref-type="bibr" rid="CR90">2003</xref>
).</p>
<p>Two strategies are often useful for discovering transcription factor targets using expression data. Often genes are turned on and off as the expression levels of their controlling TFs are altered. Thus one method is to find targets of some TFs by finding TF/gene pairs that have correlated expression patterns (Zhu et al.
<xref ref-type="bibr" rid="CR92">2002</xref>
). A second approach involves identifying groups of co-expressed genes, and hypothesizing that this co-expression is due to co-regulation by the same TF(s) (Ihmels et al.
<xref ref-type="bibr" rid="CR43">2002</xref>
,
<xref ref-type="bibr" rid="CR45">2004</xref>
). In the two sub-sections below, we describe how each of these strategies can be used to construct data vectors for SVM learning.</p>
<p>
<italic>6.1 TF–target correlations measured by profile entropy minimization</italic>
(Table 
<xref rid="Tab2" ref-type="table">2</xref>
<italic>item 17</italic>
)</p>
<p>The approach described in (Mellor and DeLisi
<xref ref-type="bibr" rid="CR60">2004</xref>
) addresses the problem of discovering condition specific regulation by searching for the conditions under which a regulator’s profile is maximally associated with a target’s profile, for example, when the TF and target have correlated expression. This essentially chooses the set of experiments where the TF most clearly and significantly controls the expression of a potential target. In this analysis correlations with a
<italic>p</italic>
-value of 10
<sup>−10</sup>
are chosen in order to extract the most significant regulatory relationships and reduce false predictions. Significant relationships are coded as 1’s in gene’s feature vector, so that every gene is described by a binary list whose length is the number of TFs (104 in this case).</p>
<p>
<italic>6.2 Target–target correlations</italic>
(Table 
<xref rid="Tab3" ref-type="table">3</xref>
<italic>item 4</italic>
)</p>
<p>For purposes of representing expression correlation between targets, we use normalized log2 ratios for each gene across 1,011 experiments (Bergman et al.
<xref ref-type="bibr" rid="CR9">2003</xref>
). Each gene’s expression profile is normalized to a mean of 0 and standard deviation of 1. This expression profile is then the vector of features used by the SVM to represent any example gene (each gene will have 1,011 features). In this case, the dot product between such gene vectors is analogous to a Pearson correlation and naturally fits into the SVM framework. Given many known targets of a transcription factor as positive cases, the SVM can identify a new target based on how closely its expression resembles that of the known examples.
<table-wrap id="Tab3">
<label>Table 3</label>
<caption>
<p>High ranking
<italic>k</italic>
-mer alignment and comparison to known binding site</p>
</caption>
<graphic xlink:href="11693_2006_9003_Tab3_HTML" id="MO19"></graphic>
<table-wrap-foot>
<p>Weight vectors for each TF classifier are used to rank all
<italic>k</italic>
-mers. Known TF motifs appear in the middle column and high ranking
<italic>k</italic>
-mers are assembled in the right column showing correspondence with the known motif. Standard nucleotide abbreviations are used. Some less common abbreviations are W = {A or T}, R = Purine, Y = Pyrimidine, S = {C or G}, K = {T or G}, M = {C or A}, D = not C</p>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec id="Sec14">
<title>7 Sparse binary encoding of promoters (Table 
<xref rid="Tab2" ref-type="table">2</xref>
item 18)</title>
<p>Efforts to encode strings into kernel representations have progressed for many applications. The mismatch, gap, and
<italic>k</italic>
-mer kernels mentioned above have been used mainly for protein classification, translation initiation site detection, and mRNA splice site identification. Another straightforward sequence representation is the sparse bit encoding (Zien et al.
<xref ref-type="bibr" rid="CR93">2000</xref>
). In this simple scheme each nucleotide in a sequence is encoded by 4 bits, only one of which is set to 1. The nucleotide is identified as A, C, T, or G based on the position of the “1” in each such set. This leaves an 800·4 = 3200 dimensional vector to describe each example sequence, and the dot product of two vectors results simply in the number of nucleotides shared between the two sequences.</p>
</sec>
<sec id="Sec15">
<title>8 Promoter curvature and bend predictions (Table 
<xref rid="Tab2" ref-type="table">2</xref>
items 19 and 26)</title>
<p>It is well known that sequence-dependent DNA bending can be an important aspect of protein–DNA interactions. Some prominent examples of proteins that induce DNA bending are the TATA-binding protein (TBP) (Masters et al.
<xref ref-type="bibr" rid="CR58">2003</xref>
), catabolite activating protein (CAP), and the yeast Mcm1 transcription factor (Acton et al.
<xref ref-type="bibr" rid="CR1">1997</xref>
). A specific sequence of nucleotides that is more prone to bending into the proper configuration would provide a ready-made site for transcription factor binding. The particular bend and curve properties of known target genes may help discriminate them from non-targets.</p>
<p>Using the “Banana” algorithm in the EMBOSS toolkit, bend and curvature predictions were made along the promoters of all yeast genes. These were used as two separate genomic methods from which to generate classifiers for all 104 TFs, one based on bend predictions and one based on curve. Specifically, bending refers to the tendency of adjacent base pairs to be non-parallel (twists and short bends of ∼3 bp), whereas curvature refers to the tendency of the double-helix axis to follow a non-linear path for a distance of several base pairs (broad loops and arcs, ∼9 bp window). Banana follows the method of Goodsell and Dickerson (
<xref ref-type="bibr" rid="CR30">1994</xref>
) which is consistent with published experimental data (Satchwell et al.
<xref ref-type="bibr" rid="CR70">1986</xref>
). The output of the Banana algorithm becomes the feature values along a promoter for each example gene. For more details on the method see our Supplementary methods, reference (Goodsell and Dickerson
<xref ref-type="bibr" rid="CR30">1994</xref>
) or see the EMBOSS website (
<ext-link ext-link-type="uri" xlink:href="http://www.emboss.sourceforge.net/apps/banana.html">http://www.emboss.sourceforge.net/apps/banana.html</ext-link>
).</p>
</sec>
<sec id="Sec16">
<title>9 Homolog conservation (Table 
<xref rid="Tab2" ref-type="table">2</xref>
item 20)</title>
<p>This method is akin to the phylogenetic profiles taken from the COG database described above. Because COG uses a strict definition of orthology, namely bi-directional best hits within a group of at least three organisms, many genes are not allocated to any ortholog group. The method described here relaxes the definition of orthology to allow a profile to be constructed for any gene, while still discriminating between well-conserved sequences and weakly conserved sequences (Snitkin et al. personal communication). These phylogenetic profiles are constructed using BLASTP to compare yeast proteins to 180 prokaryotic genomes. The resulting best hit E-values are then discretized by placing them into one of six bins based on empirically determined E-value cut-offs. The bin numbers range from 0 (no significant hit) to 5 (very significant). Thus, a typical example gene will have 180 features, each corresponding to a different genome, with values ranging from 0 to 5 indicating the strength of the best BLASTP hit of that gene’s protein to another genome.</p>
</sec>
<sec id="Sec17">
<title>10 Hydroxyl cleavage—DNA accessibility (Table 
<xref rid="Tab2" ref-type="table">2</xref>
item 21)</title>
<p>It is possible that strands of DNA sharing little sequence similarity may still share common structural motifs. Transcription factors may seek out these structural cues for binding, thereby identifying conserved structural motifs when no strong consensus sequence can be detected. Experiments show that hydroxyl (OH) radical cleavage is an effective probe for DNA structure, in that strand breaking mirrors the accessible surface areas of the sugar-phosphate backbone (Balasubramanian et al.
<xref ref-type="bibr" rid="CR4">1998</xref>
; Parker et al.
<xref ref-type="bibr" rid="CR62">2005</xref>
; Tullius and Greenbaum
<xref ref-type="bibr" rid="CR82">2005</xref>
). A database of DNA sequences and their hydroxyl cleavage patterns has been published (Parker et al.
<xref ref-type="bibr" rid="CR62">2005</xref>
). This database allows accurate prediction of backbone accessibility for any sequence by sequentially examining every 3-mer in a sequence and looking up its experimental cleavage intensity as measured by phosphor imaging of cleaved, radio-labeled DNA separated by electrophoresis (Balasubramanian et al.
<xref ref-type="bibr" rid="CR4">1998</xref>
).</p>
<p>Predictions of this sort are generated for all sequences in the yeast genome and the individual 3-mer cleavage intensities along each promoter serve as feature vectors for TF–target classification. This method could prove useful in identifying potential targets when
<italic>k</italic>
-mer counts and other sequence based methods fail.</p>
</sec>
<sec id="Sec18">
<title>11 Kmer median positions from start (Table 
<xref rid="Tab2" ref-type="table">2</xref>
item 22)</title>
<p>A potential transcription factor binding site may be functional only when within a certain distance from other binding motifs or from the start site of transcription. When such positional constraints exist, they can be used to filter out sites which would otherwise become false positive predictions.</p>
<p>For each
<italic>k</italic>
-mer in a sequence, we record its median distance from the transcription start. This dataset will be useful in classifying targets for a transcription factor only if the factor shows positional bias in promoter binding.</p>
</sec>
<sec id="Sec19">
<title>12 K-mer likelihoods (Table 
<xref rid="Tab2" ref-type="table">2</xref>
item 23)</title>
<p>Although
<italic>k</italic>
-mer counts may describe promoter composition, the abundance of non-informative sequences may hide the few
<italic>k</italic>
-mers which meaningfully contribute to class separation. Those
<italic>k</italic>
-mers which are statistically over-represented in a promoter can often be transcription factor binding sites, and this fact has been effectively used to identify biologically significant patterns (Cora et al.
<xref ref-type="bibr" rid="CR20">2004</xref>
; van Helden and Collado-Vides
<xref ref-type="bibr" rid="CR37">1998</xref>
; Haverty et al.
<xref ref-type="bibr" rid="CR34">2004</xref>
). For every possible
<italic>k</italic>
-mer 4, 5, and 6 long we calculate the probability that the
<italic>k</italic>
-mer has
<italic>x</italic>
occurrences in a gene’s promoter. The negative log of these probabilities are then the features used for SVM classification.</p>
<p>Background
<italic>k</italic>
-mer counts are obtained from RSA (van Helden
<xref ref-type="bibr" rid="CR35">2003</xref>
; van Helden and Collado-Vides
<xref ref-type="bibr" rid="CR37">1998</xref>
) tools. The prior probability (
<italic>f</italic>
) for a
<italic>k</italic>
-mer to be found in any position is calculated by dividing the total number of counts in the background sequence set by the total number of possible positions in the background set (here, the background set is the full set of 800 bp yeast promoters). Given this prior probability for a
<italic>k</italic>
-mer, the expected number of occurrences of the
<italic>k</italic>
-mer in any sequence can be calculated by
<disp-formula id="Equf">
<tex-math id="M12">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ m=f(L-k+ 1), $$\end{document}</tex-math>
</disp-formula>
where
<italic>L</italic>
is the length of the sequence and
<italic>k</italic>
is the length of the
<italic>k</italic>
-mer.</p>
<p>The goal is then to calculate the probability of finding the observed number of counts by chance given the expected number for a promoter. This can be done simply by using the probability density function of the Poisson distribution with mean
<italic>m</italic>
. This method for calculating
<italic>k</italic>
-mer likelihoods is similar to the method described in (van Helden
<xref ref-type="bibr" rid="CR36">2004</xref>
). Thus, for each gene, a
<italic>p</italic>
-value will be calculated for each
<italic>k</italic>
-mer which represents the likelihood that the
<italic>k</italic>
-mer appears as many times as observed by chance. A feature vector for a gene is then the vector of probabilities describing all
<italic>k</italic>
-mers.</p>
</sec>
<sec id="Sec20">
<title>13 Promoter melting temperature profile and promoter Delta G profile (Table 
<xref rid="Tab2" ref-type="table">2</xref>
items 24 and 25)</title>
<p>It is widely known that the initiation of transcription by polymerase involves melting of the DNA double helix. Several experiments have indicated that differences in melting temperature (
<italic>T</italic>
<sub>m</sub>
) of DNA can influence the rate of transcription by assisting or obstructing DNA melting by polymerase (Flickinger
<xref ref-type="bibr" rid="CR25">2005</xref>
), and there is evidence that torsional strain can play a role in duplex destabilization and opening (Benham
<xref ref-type="bibr" rid="CR7">1992</xref>
). Furthermore, it has been shown that sites thought to be susceptible to stress-induced duplex destabilization (SIDD) match well with gene regulatory regions (Benham
<xref ref-type="bibr" rid="CR8">1996</xref>
). It is therefore possible that transcription factors binding DNA may induce conformational adjustments in the promoter which slightly alter the stability of the helix. This change in stability may indirectly change the frequency or likelihood of transcription initiation. Indeed, recent models have shown correlation between sites of local promoter melting, regulatory sites, and initiation sites (Choi et al.
<xref ref-type="bibr" rid="CR15">2004</xref>
).</p>
<p>If certain transcription factors influence a target’s expression by altering promoter stability, its targets may contain a specific melting temperature or free-energy signature in their promoter regions. This signature could potentially distinguish targets from non-targets much as sequence motifs do. To include this information in a classifier the EMBOSS (Rice et al.
<xref ref-type="bibr" rid="CR69">2000</xref>
) toolbox is used to calculate the melting and free energy profiles of all yeast promoters using a sliding window of 20 bp. Thus, for every 20 bp increment along each upstream region, a
<italic>T</italic>
<sub>m</sub>
value and a Gibbs free energy (Δ
<italic>G</italic>
at 25°C) is calculated. For these calculations EMBOSS uses the nearest-neighbor thermodynamics from (Breslauer et al.
<xref ref-type="bibr" rid="CR13">1986</xref>
; Baldino
<xref ref-type="bibr" rid="CR5">1989</xref>
). The
<italic>T</italic>
<sub>m</sub>
profile and the free energy profile become separate feature vectors for each gene, thereby providing two additional datasets which can be used for classification.</p>
</sec>
<sec id="Sec21">
<title>PSSM comparison</title>
<p>Using the same positive and negative sets as for the SVM procedure, PSSMs are also used to make predictions across the yeast genome at various score thresholds to serve as a comparison to predictions made by SVM. The threshold used for PSSM scanning was adjusted for each TF such that the overall specificity is held constant at 0.95 to match the SVM results. Other choices of threshold do not appear to improve performance. Loosening the threshold begins to dramatically increase false positive predictions beyond a prior of 0.2. By making detection stricter, false predictions are reduced along with sensitivity.</p>
</sec>
</sec>
</sec>
<sec id="Sec22">
<title>Results and discussion</title>
<p>After data pre-processing, the analysis begins with the independent evaluation of each dataset on each TF. Several kernel functions are tested and any necessary parameters are optimized before a final classifier is constructed (see “Methods”). A schematic of our procedure is given in Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
. Once parameter optimized classifiers are constructed for each TF–dataset pair, all of the datasets, represented by the optimized kernel matrices, are combined using a weighting scheme based on their
<italic>F</italic>
<sub>1</sub>
scores. The hyper geometric test is used to filter out datasets which do not perform better than random (accept
<italic>p</italic>
-value ≤ 0.05) for a particular TF. Accuracy estimates for the combined classifier are made using a final leave-one-out cross validation.</p>
<p>Three simple weighting schemes have been tried (see “Methods”), and the primary weight for a method is the ratio of its
<italic>F</italic>
<sub>1</sub>
score with that of the best performing method. The first scheme simply multiplies all kernel matrices by their scaled
<italic>F</italic>
<sub>1</sub>
scores and sums them. The second scheme squares the weights before multiplying. This has the effect of decreasing weights of poorly performing methods. Our third scheme uses the squared tangent of the primary weight. This will more severely penalize poor performers while boosting the weights of the best methods (e.g., instead of weight 1, the best method will have a weight of 2.43).</p>
<p>We have been able to accurately classify the known targets of many transcription factors in
<italic>S. cerevisiae</italic>
. Figure 
<xref rid="Fig2" ref-type="fig">2</xref>
shows the performance of classifiers generated on each individual dataset (see also Supplementary Table 1). The combination of datasets performs better than any individual type of data, but the best single method achieves a sensitivity of 71% and a positive predictive value of 0.82. The combined datasets are labeled STD for weighting based on simply the scaled
<italic>F</italic>
<sub>1</sub>
measure, SQU for weighting based on squared, scaled
<italic>F</italic>
<sub>1</sub>
measure, and TAN for weighting based on the tangent squared
<italic>F</italic>
<sub>1</sub>
measure, as described in “Methods”. Other abbreviations can be found in Table 
<xref rid="Tab2" ref-type="table">2</xref>
. Almost all methods perform much better than random. The exceptions are GO term annotation and phylogenetic profiles. For phylogenetic profiles this is not unexpected, since only 30% of the yeast genome has an established ortholog in the COG database. This absence of data means that many positive examples can no longer contribute to classification, leading to poor performance for most TFs. The situation is similar for GO term annotation, where many genes are poorly annotated or have no known function.</p>
<p>The performance statistics mentioned in Fig. 
<xref rid="Fig2" ref-type="fig">2</xref>
are a summary of those for all 104 combined classifiers. Since there are 9,104 known positives for all regulators, a sensitivity of 71% indicates that, considering all 104 classifiers, we recover 71% of the known data. This means that classifiers for some TFs have much higher sensitivities or PPVs while other classifiers perform no better than random.</p>
<p>The most powerful individual data set uses
<italic>k</italic>
-mer counts allowing 1-missmatch per
<italic>k</italic>
-mer. However, the combination of all of the methods shows increased sensitivity and precision over all individual methods. The squared-tangent weighting function performs the best overall, reaching a sensitivity of 73% and a positive predictive value of 0.89. Looking only at the top 20 TFs, we see a sensitivity and PPV of 88.2% and 0.9, respectively. Our results show that combining datasets increases sensitivity only incrementally over classifiers built on simple
<italic>k</italic>
-mer counts alone, and that it produces a small improvement in positive predictive value. Thus, combining methods results in the modest reduction of false positive classifications.</p>
<p>The use of the hypergeometric distribution to test the significance of a dataset for each TF allows us to assess how useful a particular data type is for target identification. Figure 
<xref rid="Fig3" ref-type="fig">3</xref>
plots the percentage of TFs for which each dataset has been found to be significant at
<italic>p</italic>
 ≤ 0.05. Overall, sequence based methods (
<italic>k</italic>
-mer counts, mismatch and gapped
<italic>k</italic>
-mer counts, and
<italic>k</italic>
-mer likelihoods) show the best overall coverage, being significant for almost all transcription factors. Structural descriptions of the promoter region differ greatly in their usefulness, varying from DNA curve prediction, useful for ∼15% of TFs, to melting temperature profiles and free energy values, significant for over 60% of TFs tested.
<fig id="Fig3">
<label>Fig. 3</label>
<caption>
<p>Percentage of TFs for which each dataset is significant (
<italic>p</italic>
 ≤ 0.05). Percentage of TFs is on the left axis and datasets are numbered along the bottom with a key given to the right of the diagram (see Table 
<xref rid="Tab3" ref-type="table">3</xref>
for descriptions of method abbreviations)</p>
</caption>
<graphic position="anchor" xlink:href="11693_2006_9003_Fig3_HTML" id="MO11"></graphic>
</fig>
</p>
<p>In work with genomic datasets having large numbers of features (e.g.,
<italic>k</italic>
-mer counts, expression measurements) there is always an inherent risk of over-fitting when the number of positives and negatives are relatively small. To give a more practical portrayal of our method and prevent an overly optimistic view of the results, it is illuminating to compare our results with those from classifiers obtained by training on random data. Thus three random datasets have been constructed as controls and their results displayed in Fig. 
<xref rid="Fig2" ref-type="fig">2</xref>
. The first, abbreviated R is simply randomly shuffled
<italic>k</italic>
-mer count data. The second (RH) is created by shuffling a composite dataset composed of a random 10% selection of each individual dataset. The third (RN) is a normally distributed random set of numbers with mean of 0 and standard deviation of 1.</p>
<p>Although performance is much better than random it is doubtful from these results that predictions obtained by applying our classifiers to the entire genome would yield truly reliable targets without further processing. A simple classification of all potential targets with our 104 classifiers returns, on average, ∼800 new targets for each TF. The conditional probabilities given as output from Platt’s method (Platt
<xref ref-type="bibr" rid="CR66">1999</xref>
) allows the selection of possible targets at a desired probability threshold. For instance, one can select predictions for which the probability of being a positive is greater than 0.99. In some of the examples below, the top targets were selected in this fashion and compared to the full set of known positive genes.</p>
<p>Another method to reduce the risk of over-fitting, which we reserve for our future work, is application of sophisticated dimension reduction techniques to discover significant features in different datasets based on classifier performance. Feature selection and clustering will allow the most relevant features from different datasets to be retained while large portions of redundant and irrelevant information are discarded. In some cases this has been shown to increase classifier accuracy. In other cases, the reduction in the complexity of the problem is worthwhile since other learning algorithms, like
<italic>k</italic>
-nearest-neighbors or Bayes networks, which are difficult to train on large feature sets, could be compared efficiently on the smaller set of features. Although it is clear that combination of data slightly increases performance it is natural to ask whether such complexity of data is worthwhile when
<italic>k</italic>
-mer based data alone contributes a large portion of the classification accuracy. Dimension reduction techniques can help address this by potentially eliminating thousands of features. This will make it simpler to classify new sequences for which not all datasets are available since only the most relevant features need be present. In practice, it is likely that only a few data types will be needed to make useful predictions for most applications.
<italic>K</italic>
-mer counts,
<italic>k</italic>
-mer overrepresentation, and an improved measure of sequence conservation might comprise a baseline dataset for further refinement.</p>
<p>The dynamics of the individual classifiers can also be examined based on distributions of sensitivity and
<italic>F</italic>
<sub>1</sub>
score as compared to the random classifier. Figure 
<xref rid="Fig4" ref-type="fig">4</xref>
a, c show the distribution of
<italic>F</italic>
<sub>1</sub>
score and sensitivity, respectively, for normal random data. Figure 
<xref rid="Fig4" ref-type="fig">4</xref>
b, d show the same distributions but for actual data (26 method combination with tangent weights). The sensitivities and
<italic>F</italic>
<sub>1</sub>
scores for actual data have distributions heavily shifted to the right as opposed to those for random data. Although the majority of classifiers are comparatively good, several TFs have poor performance, something which warrants further inspection. There are four classifiers for which the
<italic>F</italic>
<sub>1</sub>
score and sensitivity are zero (YHL020C, YNL139C, YER068W, and YER161C). These factors have comparatively few known targets compared to others. On average these four TFs have 10 targets each (one of them has only three positives) in their training sets compared to an average of 88 targets for most regulators. This low number of positive examples is likely the cause of the poor performance. Figure 
<xref rid="Fig5" ref-type="fig">5</xref>
shows a plot of sensitivity vs. TF sorted by increasing number of positives for all classifiers. The general trend shows that classifiers having more positives give better performance.
<fig id="Fig4">
<label>Fig. 4</label>
<caption>
<p>Random vs. combined classifiers. (
<bold>a</bold>
) Distribution of
<italic>F</italic>
<sub>1</sub>
scores for normal random classifiers, (
<bold>b</bold>
) the same distribution on classifiers made from 26 dataset combinations for all TFs. (
<bold>c</bold>
) Sensitivity distribution for normal random classifiers and (
<bold>d</bold>
) the sensitivity distribution for the 26 dataset classifiers for all TFs</p>
</caption>
<graphic position="anchor" xlink:href="11693_2006_9003_Fig4_HTML" id="MO12"></graphic>
</fig>
<fig id="Fig5">
<label>Fig. 5</label>
<caption>
<p>Sensitivity as a function of increasing positives. Classifiers for each TF were sorted according to increasing number of positives and the trend in their sensitivity is shown. Generally, classifiers with more positive examples perform better</p>
</caption>
<graphic position="anchor" xlink:href="11693_2006_9003_Fig5_HTML" id="MO13"></graphic>
</fig>
</p>
<sec id="Sec23">
<title>Biological insights—promoter melting</title>
<p>Beyond categorizing genomic datasets as useful or not for classification purposes, the significance of a particular dataset has potential biological implications for a TF. To see if this could be explored based on our results, the factor YJR060W was chosen for further examination, since the promoter melting temperature profile is significant for this TF at
<italic>p</italic>
 = 0.0037. Figure 
<xref rid="Fig6" ref-type="fig">6</xref>
shows a plot of the average promoter melting temperature curve (calculated using a 20 bp window and moving in steps of 1 bp) over all genes in yeast (solid blue), the average curve for genes in this TF’s negative set (dashed blue), the average in the TF’s positive set (dashed red), and the average in the most significant 33 predicted targets of the TF (solid red). The top 33 targets have Platt conditional probabilities
<italic>P</italic>
(positive | distance from separator) ≥ 0.99 and are obtained from the predictions made using the combination of all datasets, thus representing the best predictions we can make for this TF. This is equivalent to choosing predictions significant with a
<italic>p</italic>
-value of 0.01. These most significant targets contain 18 new predictions which are not part of the original positive set.
<fig id="Fig6">
<label>Fig. 6</label>
<caption>
<p>Melting temperature curves YJR060W. Using a 20 bp window for DNA melting temperature calculation, the temperature plots are presented for the average over all 5571 yeast genes (solid blue), positive targets for YJR060W (dashed red), negatives for YJR060W (dashed blue), and high confidence targets (solid red—
<italic>P</italic>
(true|distance to separator) ≥ 0.99) determined using Platt’s method for probability assignment to SVM output. Under the graph is an indicator displaying hits to the YJR060W consensus sequence in the top 33 targets. Consensus hits are distributed throughout the 800 bp upstream space</p>
</caption>
<graphic position="anchor" xlink:href="11693_2006_9003_Fig6_HTML" id="MO14"></graphic>
</fig>
</p>
<p>Clearly, the positive and negative groups for this TF contain average differences in promoter melting temperature. This difference is magnified when only the best targets are examined. The best 33 predictions have a very different melting signature from the negative set and the average yeast gene. A two-sample
<italic>t</italic>
-test was used to find the significance of this difference from the average curve. The purple over bar in Fig. 
<xref rid="Fig6" ref-type="fig">6</xref>
shows the window positions where the best targets have an average value which is significant at
<italic>p</italic>
 ≤ 0.01. Almost all positions show a significant increase in melting temperature, with the exception of several positions proximal to the transcription start site. Considering that the transcription machinery must unwind the helix in this region, it is not unexpected that the melting temperature here would be smaller, as this would lower the activation energy needed to dissociate the strands.</p>
<p>As reviewed in “Methods”, there is ample support for the idea that melting temperature can influence transcription (Flickinger
<xref ref-type="bibr" rid="CR25">2005</xref>
), and that torsional strain can affect the stability of the DNA duplex (Benham
<xref ref-type="bibr" rid="CR7">1992</xref>
). Experiments have also shown that sites susceptible to this kind of destabilization correlate well with regulatory regions (Benham
<xref ref-type="bibr" rid="CR8">1996</xref>
). In light of the high melting temperature of promoter targets of YJR060W, it is possible that duplex destabilization plays a role in regulation by this TF. Indeed, experiments have shown that YJR060W functions largely in recruiting chromatin remodelling factors to proximal promoters (Kent et al.
<xref ref-type="bibr" rid="CR50">2004</xref>
). The exact mechanism for this recruitment is not fully understood, but it is required for transcription at some promoters and complementary to additional binding factors at others (Kent et al.
<xref ref-type="bibr" rid="CR50">2004</xref>
). In any case a possible hypothesis is that duplex stability is an important mechanism for regulation at these promoters and that YJR060W binding affects this stability either by conformational change induced by its binding or induced by the recruitment of chromatin remodelling factors. The conformational changes may alter the torsional strain on the DNA and thus affect the melting temperature prior to transcription.</p>
</sec>
<sec id="Sec24">
<title>Biological insights—binding site detection</title>
<p>Our results demonstrate that there is clearly a signal identifying ChIP–chip positives from other genes. Other groups have had less success confirming the validity of the ChIP–chip data, and this has led some to consider that as many as 50% (Simonis et al.
<xref ref-type="bibr" rid="CR74">2004</xref>
) to 60% (Gao et al.
<xref ref-type="bibr" rid="CR28">2004</xref>
) of the targets produced by ChIP–chip are false positives in the assay. The fact that the high throughput results are chosen to be significant with
<italic>p</italic>
 ≤ 0.001 indicates that the transcription factors do in fact bind their targets. It is certainly possible that this binding does not always translate into changes in gene expression, that the changes are not large enough to be considered significant, or perhaps that the conditions under which binding would result in expression change were not tested. In any case, our classifier appears to pick up the information necessary to identify target genes.</p>
<p>To find this signal we have looked at the results of various individual datasets and extracted the attributes which contribute most to a transcription factor’s classifier. Support vector machines are often considered a “black box” method, since their results are not as readily interpretable as, for instance, the probability assessment of Bayesian classifiers. Nevertheless, the
<bold>w</bold>
vector described above can give an indication of which features in the data are important to the classification. Features whose components
<italic>w</italic>
<sub>
<italic>i</italic>
</sub>
are large correspond to dimensions in feature space where positives and negatives are more widely separated. Thus by examining a single dataset, e.g.
<italic>k</italic>
-mer counts, it is possible to determine the
<italic>k</italic>
-mer(s) most responsible for differences between positives and negatives. To this end,
<bold>w</bold>
-vectors from the
<italic>k</italic>
-mer count dataset have been calculated for each linear TF classifier and examined to determine which
<italic>k</italic>
-mers had the largest weights. We compare these
<italic>k</italic>
-mers to known binding sites for each factor. Results for the best 10 TFs can be seen in Table 
<xref rid="Tab3" ref-type="table">3</xref>
, where the highest ranked
<italic>k</italic>
-mers are manually assembled to show their correspondence with known binding motifs. In most cases the
<italic>k</italic>
-mers with the highest weights match closely the reported binding site for the TF, showing that the algorithm is choosing meaningful features for classification. For example, the DNA binding protein Cep1 is known the bind the consensus TCACGTG and regulate cell cycle and stress response genes. The highest weighted
<italic>k</italic>
-mer in the classifier is CACGT, and the top 4
<italic>k</italic>
-mers all overlap precisely with the known site (CACGT, CGTG, TCACG, TCACGT).</p>
</sec>
<sec id="Sec25">
<title>Biological insights—microarray expression</title>
<p>The ability to identify the primary conditions under which a transcription factor exerts control would be a critical component of any focused study of gene regulation. As we have seen, the
<bold>w</bold>
vector generated on a dataset indicates which of its components are most important for discriminating targets. In the case of gene expression classifiers,
<bold>w</bold>
elucidates which expression conditions are discriminatory. Intuitively, these are the conditions in which we would expect to see differential regulation of true targets. Given the predictions made using the combination of all methods, and the
<bold>w</bold>
obtained from the linear classifier built on expression data alone, we can see whether the predicted targets have differential regulation, and identify conditions where the TF is likely to act.</p>
<p>By the hypergeometric test, expression data is a significant predictor (
<italic>p</italic>
 = 6.12e − 14) of targets for Fhl1, a forkhead-like TF known to be involved in rRNA processing and ribosomal protein gene expression. The
<bold>w</bold>
for this TF’s classifier from expression data has been calculated and sorted to determine the conditions having the highest weight. Figure 
<xref rid="Fig7" ref-type="fig">7</xref>
shows a plot of expression values over the top 25 conditions for the average yeast gene (solid blue), the average for genes in Fhl1’s negative set (dashed blue), the average in the positive set (dashed red), and the average in the most significant (
<italic>P</italic>
(true) ≥ 0.99) 48 targets of this TF (solid red).
<fig id="Fig7">
<label>Fig. 7</label>
<caption>
<p>Expression plot of Fhl1 targets over top 25 discriminative conditions. Average expression is plotted over all 5571 yeast genes (solid blue), over the negative set for Fhl1 (dashed blue), the positive targets (dashed red), and the most significant targets (solid red),
<italic>P</italic>
(true | distance from classifier) ≥ 0.99. The best targets have expression significantly different than the average or negative genes. The chosen expression conditions, ranked by
<bold>w</bold>
-vector from the expression based classifier, are shown under the graph with numbers indicating the position of the conditions in the graph. These conditions make sense since Fhl1 is regulated by the TOR signalling pathway, which is blocked by rapamycin. There is also some support in the literature for TOR having a role in meiosis and stress response</p>
</caption>
<graphic position="anchor" xlink:href="11693_2006_9003_Fig7_HTML" id="MO15"></graphic>
</fig>
</p>
<p>For 23 of the top 25 conditions the highly significant targets show expression which is different from both the average and the negative sets (
<italic>t</italic>
-test
<italic>p</italic>
-value ≤ 0.01). Most importantly, the best 10 ranked conditions contain six where yeast cells were treated with rapamycin and two involving meiosis/sporulation. This result is satisfying since rapamycin treatment specifically inhibits the Target of Rapamycin (TOR) signalling pathway, which is known to activate ribosomal protein expression as well as regulate several other pathways in yeast. Inhibition of TOR directly prevents Fhl1 from binding at promoter sites, thereby down-regulating expression of ribosomal protein genes (Martin et al.
<xref ref-type="bibr" rid="CR57">2004</xref>
), explaining why Fhl1 targets show differential expression in these experiments. Furthermore, although Fhl1 has not been directly implicated in meiosis, TOR pathway kinases are required for meiosis (Zheng and Schreiber
<xref ref-type="bibr" rid="CR91">1997</xref>
), indirectly suggesting that Flh1 might be involved. This is a reasonable suggestion since Fhl1 has been shown to alter its activity in response to factors (mainly Sfp1 which is also under TOR control) controlling progression to Start in the yeast cell cycle. Thus the most highly ranked experiments seem to correlate well with the real biological roles of the TF, indicating that the SVM can correctly rank important experimental conditions. Our method can identify differential regulation as an important predictor of target genes (hypergeometric test) and use the SVM-based classifier to make testable hypothesis about which conditions show biological effects of transcription factor activity.</p>
</sec>
<sec id="Sec26">
<title>Biological insights—PSSM comparison</title>
<p>We have found that support vector classification performs better than a simple weight matrix scan, and the combination of 26 methods outperforms any one method by itself. In some sense, a direct comparison with these PSSMs is not entirely fair since a majority of the weight matrices used here were created by motif discovery procedures rather than directed experimentation (such as DNA footprinting). Also, carefully constructed variants of PSSMs, which may take into account motif conservation in multiple species or interdependence of bases, can offer state of the art motif detection. Unfortunately, sufficient data is not always available to build such detailed models. The purpose of our comparison is simply to highlight the improved performance of classification methods relative to the commonly available binding site models. Figure 
<xref rid="Fig8" ref-type="fig">8</xref>
shows the result of a comparison between simple PSSM scanning using the MotifScanner algorithm and predictions by SVM on combined data. The leftmost grouping is a result from scans using PSSMs for all 104 TFs against the positive and negative sets on which the SVMs were trained. The MotifScanner score threshold was chosen individually for each TF so that the specificity on the training set was held constant at 0.95. This makes comparison to the SVM classifiers more straightforward as overall specificity for the SVMs is 0.95. The grouping on the right restates the performance of the SVMs with 26 combined datasets on the full set of positives. The SVM classifiers outperform PSSMs in the number of detected positives. It is clear that loosening the thresholds for the PSSMs would allow for better coverage but degrade performance by increasing the number of false positive predictions. Support vector machine classifiers offer a good balance between sensitivity and false prediction.
<fig id="Fig8">
<label>Fig. 8</label>
<caption>
<p>SVM vs. PSSM scan. Left: PSSMs for 104 TFs scanned against positive and negative sets. Overall specificity is held constant to 0.95 to match that of the SVM results. Right: Overall results for SVM classifiers trained on weighted combination of 18 datasets</p>
</caption>
<graphic position="anchor" xlink:href="11693_2006_9003_Fig8_HTML" id="MO16"></graphic>
</fig>
</p>
</sec>
<sec id="Sec27">
<title>Biological insights—pathway control</title>
<p>Finally, we have applied the combined classifier for each TF to all promoters in the yeast genome in order to expand the known binding repertoire of each factor. On average, each classifier produced approximately 884 new targets. Although it is unlikely that this set is free of false positives, examining the data in the context of biochemical pathways can shed light on significant predictions, which can quickly elucidate new sites which are good candidates for further study.</p>
<p>Gcn4 is a transcription factor in yeast known to control genes in the amino acid biosynthetic pathway (Hinnebusch
<xref ref-type="bibr" rid="CR38">1992</xref>
), and SVM predictions match well with the known biology of Gcn4 control mechanisms. The final classifier for this TF has an F1 score of 0.89, sensitivity of 0.86, and PPV of 0.92. This TF is a master regulator which has known targets in at least 12 amino acid biosynthetic pathways and has been shown by gene expression to induce at least 1/10th of the yeast genome (Hinnebusch and Natarajan
<xref ref-type="bibr" rid="CR39">2002</xref>
). Figure 
<xref rid="Fig9" ref-type="fig">9</xref>
highlights some known targets of Gcn4 in methionine/threonine biosynthesis in the aspartate family pathway. Branch-points from this pathway can ultimately lead to the amino acids methionine, threonine, lysine, and isoleucine. This group is of particular interest to humans since they are essential and not synthesized in the human metabolism. Gcn4 is known to regulate the genes Hom3, Thr1 and Thr4 leading to threonine, lysine, and isoleucine. However, predictions by SVM indicate it also directly targets committed steps of methionine biosynthesis by binding Met2, Met17, and Met6, which are interesting targets for further study.
<fig id="Fig9">
<label>Fig. 9</label>
<caption>
<p>GCN4 and amino acid biosynthesis. Predictions by SVM match well with the known biology of Gcn4 control mechanisms. Pathway map generated taken from the Pathway Tool Omics Viewer at SGD (Christie et al.
<xref ref-type="bibr" rid="CR16">2004</xref>
)</p>
</caption>
<graphic position="anchor" xlink:href="11693_2006_9003_Fig9_HTML" id="MO17"></graphic>
</fig>
</p>
<p>Previously Gcn4 was known to indirectly influence synthesis of methionine by activating Met4, a transcription factor specific to methionine biosynthesis and sulphur metabolism (Mountain et al.
<xref ref-type="bibr" rid="CR61">1993</xref>
). It is feasible that regulation of these enzymes by both Gcn4 and target Met4 represents a transcriptional feed-forward loop. Such loops have been described before and can be advantageous to an organism by exhibiting sign-sensitive delay, since it may be useful to have a quick response when shifting to an OFF state and a slow response when turning back ON (Mangan et al.
<xref ref-type="bibr" rid="CR56">2003</xref>
).</p>
<p>The Rap1 DNA binding factor is a widely known regulator in the cell cycle, acting as a repressor or activator depending on its context. Rap1 is also a key element in the structure of yeast telomeres, where it plays a role in telomere silencing (Pina et al.
<xref ref-type="bibr" rid="CR65">2003</xref>
). In a seemingly contradictory role, Rap1 has also been shown to regulate several glycolytic enzymes, as shown in Fig. 
<xref rid="Fig10" ref-type="fig">10</xref>
. The specificity of this glycolytic regulation is dependent on a second factor, Gcr2, which binds to the Rap1/Gcr1 complex but does not contact DNA directly (Deminoff and Santangelo
<xref ref-type="bibr" rid="CR22">2001</xref>
). New predictions by SVM in the pathways of sugar metabolism show good correspondence with expectations for Rap1 (Fig. 
<xref rid="Fig10" ref-type="fig">10</xref>
). Most interestingly, the new predictions include both isoforms of the enzyme phosphofructokinase. This step, where fructose-6-phosphase is converted into fructose-1,6-bisphosphate, is the crucial step in sugar breakdown where most metabolic flux through the pathway is controlled (Zubay
<xref ref-type="bibr" rid="CR94">1996</xref>
).
<fig id="Fig10">
<label>Fig. 10</label>
<caption>
<p>Rap1 and glycolytic/TCA cycle reaction. Glycolysis leading to acetate and ethanol are shown. The gray box on the left contains a pathway overview of glycolysis, fermentation and the TCA cycle, where red connections are known and yellow are predicted. Rap1 can be seen to regulate key control points in glycolysis and the TCA cycle. Pathway map generated taken from the Pathway Tool Omics Viewer at SGD (Christie et al.
<xref ref-type="bibr" rid="CR16">2004</xref>
)</p>
</caption>
<graphic position="anchor" xlink:href="11693_2006_9003_Fig10_HTML" id="MO18"></graphic>
</fig>
</p>
<p>Also of significance is the prediction that Rap1 regulates malate dehydrogenase in the TCA cycle. Malate dehydrogenase is unique in the TCA cycle in that it has a very small equilibrium constant, meaning that the forward reaction from malate to oxaloacetate is highly unfavorable. This is generally overcome during aerobic growth since the subsequent reaction is extremely favorable (large free energy release). However, in the absence of oxygen the cell still requires certain intermediates which can now not be made in the normal way. Running the malate dehydrogenase reaction in reverse, a favorable direction, can provide a way to synthesize these intermediates (Zubay
<xref ref-type="bibr" rid="CR94">1996</xref>
). Rap1 is already known to regulate the conversion of acetaldehyde to ethanol via alcohol dehydrogenase, and the possible complementary control of malate dehydrogenase suggests a possible role for Rap1 in regulation of fermentative growth.</p>
</sec>
</sec>
<sec id="Sec28">
<title>Conclusions</title>
<p>We have seen that support vector machines can accurately classify transcription factor binding sites using a wide range of genomic data types. Combining various information sources can reduce false positives and incrementally increase sensitivity, while post-processing of the data to assign posterior probabilities allows the selection of high confidence targets. Although the maximal margin of SVMs is resistant to over-fitting, it can be further abrogated by selecting the best features for classifier construction. Feature selection and clustering techniques can be used in future work to refine predictions and more efficiently compare the SVM to other learning machines (KNN, Bayes, and Neural Network) which do not easily handle high dimensional or correlated data.</p>
<p>Based on
<italic>k</italic>
-mer data, SVMs appear to be isolating appropriate features for classification where many known transcription factor binding sites overlap with highest ranked
<italic>k</italic>
-mers. Examination of melting temperature classifiers for YJR060W demonstrates the unique biological features of targets for that TF. Similarly, expression-based classifiers for Fhl1 show the conditions under which Fhl1 acts on its targets, pointing the way to testable hypotheses supported by data in the literature. Finally, targets of Gcn4 and Rap1, when put into the context of biological pathways, correspond well to published experiments and show the effectiveness of integrated classifiers for building system-wide gene regulatory networks. Future work will then involve development of methods to discover biologically significant features in different datasets based on classifier performance and intelligent dimension-reduction techniques to reduce noise and improve accuracy.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Electronic supplementary material</title>
<supplementary-material id="N0x1a5a960N0x2ccc680" content-type="local-data">
<media xlink:href="11693_2006_9003_MOESM1_ESM.doc" id="MOESM1" mimetype="application" mime-subtype="msword">
<caption>
<p>Supplementary material</p>
</caption>
</media>
<media xlink:href="11693_2006_9003_MOESM2_ESM.doc" id="MOESM2" mimetype="application" mime-subtype="msword">
<caption>
<p>Supplementary material</p>
</caption>
</media>
<media xlink:href="11693_2006_9003_MOESM3_ESM.doc" id="MOESM3" mimetype="application" mime-subtype="msword">
<caption>
<p>Supplementary material</p>
</caption>
</media>
<media xlink:href="11693_2006_9003_MOESM4_ESM.xls" id="MOESM4" mimetype="application" mime-subtype="vnd.ms-excel">
<caption>
<p>Supplementary material</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<pmc-comment> WARNING: ArticleNote has not been converted </pmc-comment>
<ack>
<title>Acknowledgements</title>
<p>We acknowledge Steve Parker and Tom Tullius for providing the DNA hydroxyl cleavage predictions from their database, and Adam Gustafson and Evan Snitkin for the Homolog Conservation (method 8) profiles for yeast genes.</p>
</ack>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Acton</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Zhong</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Vershon</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>DNA-binding specificity of Mcm1: operator mutations that alter DNA-bending and transcriptional activities by a MADS box protein</article-title>
<source>Mol Cell Biol</source>
<year>1997</year>
<volume>17</volume>
<fpage>1881</fpage>
<lpage>1889</lpage>
</citation>
<citation citation-type="display-unstructured">Acton T, Zhong H, Vershon A (1997) DNA-binding specificity of Mcm1: operator mutations that alter DNA-bending and transcriptional activities by a MADS box protein. Mol Cell Biol 17:1881–1889
<pub-id pub-id-type="pmid">9121436</pub-id>
</citation>
</ref>
<ref id="CR2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Aerts</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Thijs</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Coessens</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Staes</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Moreau</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>De Moor</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Toucan: deciphering the cis-regulatory logic of coregulated genes</article-title>
<source>Nucleic Acids Res</source>
<year>2003</year>
<volume>31</volume>
<fpage>1753</fpage>
<lpage>1764</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkg268</pub-id>
</citation>
<citation citation-type="display-unstructured">Aerts S, Thijs G, Coessens B, Staes M, Moreau Y, De Moor B (2003) Toucan: deciphering the cis-regulatory logic of coregulated genes. Nucleic Acids Res 31:1753–1764
<pub-id pub-id-type="pmid">12626717</pub-id>
</citation>
</ref>
<ref id="CR3">
<citation citation-type="other">Allocco D, Kohane I, Butte A (2004) Quantifying the relationship between co-expression, co-regulation, and gene function. BMC Bioinformatics 5:18</citation>
</ref>
<ref id="CR4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Balasubramanian</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Pogozelski</surname>
<given-names>WK</given-names>
</name>
<name>
<surname>Tullius</surname>
<given-names>TD</given-names>
</name>
</person-group>
<article-title>DNA strand breaking by the hydroxyl radical is governed by the accessible surface areas of the hydrogen atoms of the DNA backbone</article-title>
<source>PNAS</source>
<year>1998</year>
<volume>95</volume>
<fpage>9738</fpage>
<lpage>9743</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.95.17.9738</pub-id>
</citation>
<citation citation-type="display-unstructured">Balasubramanian B, Pogozelski WK, Tullius TD (1998) DNA strand breaking by the hydroxyl radical is governed by the accessible surface areas of the hydrogen atoms of the DNA backbone. PNAS 95:9738–9743
<pub-id pub-id-type="pmid">9707545</pub-id>
</citation>
</ref>
<ref id="CR5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Baldino</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>High-resolution in situ hybridization histochemistry</article-title>
<source>Meth Enzymol</source>
<year>1989</year>
<volume>168</volume>
<fpage>761</fpage>
<lpage>777</lpage>
</citation>
<citation citation-type="display-unstructured">Baldino F (1989) High-resolution in situ hybridization histochemistry. Meth Enzymol 168:761–777
<pub-id pub-id-type="pmid">2725322</pub-id>
</citation>
</ref>
<ref id="CR6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Beer</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Tavazoie</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Predicting gene expression from sequence</article-title>
<source>Cell</source>
<year>2004</year>
<volume>117</volume>
<fpage>185</fpage>
<lpage>198</lpage>
<pub-id pub-id-type="doi">10.1016/S0092-8674(04)00304-6</pub-id>
</citation>
<citation citation-type="display-unstructured">Beer MA, Tavazoie S (2004) Predicting gene expression from sequence. Cell 117:185–198
<pub-id pub-id-type="pmid">15084257</pub-id>
</citation>
</ref>
<ref id="CR7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Benham</surname>
<given-names>CJ</given-names>
</name>
</person-group>
<article-title>Energetics of the strand separation transition in superhelical DNA</article-title>
<source>J Mol Biol</source>
<year>1992</year>
<volume>225</volume>
<fpage>835</fpage>
<lpage>847</lpage>
<pub-id pub-id-type="doi">10.1016/0022-2836(92)90404-8</pub-id>
</citation>
<citation citation-type="display-unstructured">Benham CJ (1992) Energetics of the strand separation transition in superhelical DNA. J Mol Biol 225:835–847
<pub-id pub-id-type="pmid">1602485</pub-id>
</citation>
</ref>
<ref id="CR8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Benham</surname>
<given-names>CJ</given-names>
</name>
</person-group>
<article-title>Duplex destabilization in superhelical DNA is predicted to occur at specific transcriptional regulatory regions</article-title>
<source>J Mol Biol</source>
<year>1996</year>
<volume>255</volume>
<fpage>425</fpage>
<lpage>434</lpage>
<pub-id pub-id-type="doi">10.1006/jmbi.1996.0035</pub-id>
</citation>
<citation citation-type="display-unstructured">Benham CJ (1996) Duplex destabilization in superhelical DNA is predicted to occur at specific transcriptional regulatory regions. J Mol Biol 255:425–434
<pub-id pub-id-type="pmid">8568887</pub-id>
</citation>
</ref>
<ref id="CR9">
<citation citation-type="other">Bergman S, Ihmels J, Barkai N (2003) Iterative signature algorithm for the analysis of large-scale gene expression data. Phys Rev 67:031902</citation>
</ref>
<ref id="CR10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Birnbaum</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Benfey</surname>
<given-names>PN</given-names>
</name>
<name>
<surname>Shasha</surname>
<given-names>DE</given-names>
</name>
</person-group>
<article-title>cis Element/transcription factor analysis (cis/TF): a method for discovering transcription factor/cis element relationships</article-title>
<source>Genome Res</source>
<year>2001</year>
<volume>11</volume>
<fpage>1567</fpage>
<lpage>1573</lpage>
<pub-id pub-id-type="doi">10.1101/gr.158301</pub-id>
</citation>
<citation citation-type="display-unstructured">Birnbaum K, Benfey PN, Shasha DE (2001) cis Element/transcription factor analysis (cis/TF): a method for discovering transcription factor/cis element relationships. Genome Res 11:1567–1573
<pub-id pub-id-type="pmid">11544201</pub-id>
</citation>
</ref>
<ref id="CR11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Birney</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Andrews</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Caccamo</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Clarke</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Coates</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Cox</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Cunningham</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Curwen</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Cutts</surname>
<given-names>T</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Ensembl 2006</article-title>
<source>Nucleic Acids Res</source>
<year>2006</year>
<volume>34</volume>
<fpage>D556</fpage>
<lpage>D561</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkj133</pub-id>
</citation>
<citation citation-type="display-unstructured">Birney E, Andrews D, Caccamo M, Chen Y, Clarke L, Coates G, Cox T, Cunningham F, Curwen V, Cutts T et al (2006) Ensembl 2006. Nucleic Acids Res 34:D556–D561
<pub-id pub-id-type="pmid">16381931</pub-id>
</citation>
</ref>
<ref id="CR12">
<citation citation-type="other">Bishop C (1995) Neural networks for pattern recognition. Oxford University Press</citation>
</ref>
<ref id="CR13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Breslauer</surname>
<given-names>KJ</given-names>
</name>
<name>
<surname>Frank</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Blocker</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Marky</surname>
<given-names>LA</given-names>
</name>
</person-group>
<article-title>Predicting DNA duplex stability from the base sequence</article-title>
<source>PNAS</source>
<year>1986</year>
<volume>83</volume>
<fpage>3746</fpage>
<lpage>3750</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.83.11.3746</pub-id>
</citation>
<citation citation-type="display-unstructured">Breslauer KJ, Frank R, Blocker H, Marky LA (1986) Predicting DNA duplex stability from the base sequence. PNAS 83:3746–3750
<pub-id pub-id-type="pmid">3459152</pub-id>
</citation>
</ref>
<ref id="CR14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bussemaker</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Siggia</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Regulatory element detection using correlation with expression</article-title>
<source>Nat Genet</source>
<year>2001</year>
<volume>27</volume>
<fpage>167</fpage>
<lpage>171</lpage>
<pub-id pub-id-type="doi">10.1038/84792</pub-id>
</citation>
<citation citation-type="display-unstructured">Bussemaker H, Li H, Siggia E (2001) Regulatory element detection using correlation with expression. Nat Genet 27:167–171
<pub-id pub-id-type="pmid">11175784</pub-id>
</citation>
</ref>
<ref id="CR15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Choi</surname>
<given-names>CH</given-names>
</name>
<name>
<surname>Kalosakas</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Rasmussen</surname>
<given-names>KO</given-names>
</name>
<name>
<surname>Hiromura</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Bishop</surname>
<given-names>AR</given-names>
</name>
<name>
<surname>Usheva</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>DNA dynamically directs its own transcription initiation</article-title>
<source>Nucleic Acids Res</source>
<year>2004</year>
<volume>32</volume>
<fpage>1584</fpage>
<lpage>1590</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkh335</pub-id>
</citation>
<citation citation-type="display-unstructured">Choi CH, Kalosakas G, Rasmussen KO, Hiromura M, Bishop AR, Usheva A (2004) DNA dynamically directs its own transcription initiation. Nucleic Acids Res 32:1584–1590
<pub-id pub-id-type="pmid">15004245</pub-id>
</citation>
</ref>
<ref id="CR16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Christie</surname>
<given-names>KR</given-names>
</name>
<name>
<surname>Weng</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Balakrishnan</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Costanzo</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Dolinski</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Dwight</surname>
<given-names>SS</given-names>
</name>
<name>
<surname>Engel</surname>
<given-names>SR</given-names>
</name>
<name>
<surname>Feierbach</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Fisk</surname>
<given-names>DG</given-names>
</name>
<name>
<surname>Hirschman</surname>
<given-names>JE</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from
<italic>Saccharomyces cerevisiae</italic>
and related sequences from other organisms</article-title>
<source>Nucleic Acids Res</source>
<year>2004</year>
<volume>32</volume>
<fpage>D311</fpage>
<lpage>D314</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkh033</pub-id>
</citation>
<citation citation-type="display-unstructured">Christie KR, Weng S, Balakrishnan R, Costanzo MC, Dolinski K, Dwight SS, Engel SR, Feierbach B, Fisk DG, Hirschman JE et al (2004) Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res 32:D311–D314
<pub-id pub-id-type="pmid">14681421</pub-id>
</citation>
</ref>
<ref id="CR17">
<citation citation-type="other">Cliften PF et al [
<ext-link ext-link-type="uri" xlink:href="http://www.genetics.wustl.edu/saccharomycesgenomes/">http://www.genetics.wustl.edu/saccharomycesgenomes/</ext-link>
]. 2003a</citation>
</ref>
<ref id="CR18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cliften</surname>
<given-names>PF</given-names>
</name>
<name>
<surname>Johnston</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Finding functional features in Saccharomyces genomes by phylogenetic footprinting</article-title>
<source>Science</source>
<year>2003b</year>
<volume>301</volume>
<fpage>71</fpage>
<lpage>76</lpage>
<pub-id pub-id-type="doi">10.1126/science.1084337</pub-id>
</citation>
<citation citation-type="display-unstructured">Cliften PF, Johnston M et al (2003b) Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science 301:71–76
<pub-id pub-id-type="pmid">12775844</pub-id>
</citation>
</ref>
<ref id="CR19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Conlon</surname>
<given-names>EM</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>XS</given-names>
</name>
<name>
<surname>Lieb</surname>
<given-names>JD</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>JS</given-names>
</name>
</person-group>
<article-title>Integrating regulatory motif discovery and genome-wide expression analysis</article-title>
<source>PNAS</source>
<year>2003</year>
<volume>100</volume>
<fpage>3339</fpage>
<lpage>3344</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.0630591100</pub-id>
</citation>
<citation citation-type="display-unstructured">Conlon EM, Liu XS, Lieb JD, Liu JS (2003) Integrating regulatory motif discovery and genome-wide expression analysis. PNAS 100:3339–3344
<pub-id pub-id-type="pmid">12626739</pub-id>
</citation>
</ref>
<ref id="CR20">
<citation citation-type="other">Cora D, Di Cunto F, Provero P, Silengo L, Caselle M (2004) Computational identification of transcription factor binding sites by functional analysis of sets of genes sharing overrepresented upstream motifs. BMC Bioinformatics 5:57</citation>
</ref>
<ref id="CR21">
<citation citation-type="other">Dean R (2005) Fungal Genomics Laboratory at North Carolina State University and the Broad Institute: Magnaporthe Sequencing Project: [
<ext-link ext-link-type="uri" xlink:href="http://www.fungalgenomics.ncsu.edu">http://www.fungalgenomics.ncsu.edu</ext-link>
,
<ext-link ext-link-type="uri" xlink:href="http://www.broad.mit.edu">http://www.broad.mit.edu</ext-link>
]</citation>
</ref>
<ref id="CR22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Deminoff</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Santangelo</surname>
<given-names>GM</given-names>
</name>
</person-group>
<article-title>Rap1p requires Gcr1p and Gcr2p homodimers to activate ribosomal protein and glycolytic genes, respectively</article-title>
<source>Genetics</source>
<year>2001</year>
<volume>158</volume>
<fpage>133</fpage>
<lpage>143</lpage>
</citation>
<citation citation-type="display-unstructured">Deminoff SJ, Santangelo GM (2001) Rap1p requires Gcr1p and Gcr2p homodimers to activate ribosomal protein and glycolytic genes, respectively. Genetics 158:133–143
<pub-id pub-id-type="pmid">11333224</pub-id>
</citation>
</ref>
<ref id="CR23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Elemento</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Tavazoie</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Fast and systematic genome-wide discovery of conserved regulatory elements using a non-alignment based approach</article-title>
<source>Genome Biol</source>
<year>2005</year>
<volume>6</volume>
<fpage>R18</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2005-6-2-r18</pub-id>
</citation>
<citation citation-type="display-unstructured">Elemento S, Tavazoie S (2005) Fast and systematic genome-wide discovery of conserved regulatory elements using a non-alignment based approach. Genome Biol 6:R18
<pub-id pub-id-type="pmid">15693947</pub-id>
</citation>
</ref>
<ref id="CR24">
<citation citation-type="other">Emboss Website: [
<ext-link ext-link-type="uri" xlink:href="http://www.emboss.sourceforge.net/apps/banana.html">http://www.emboss.sourceforge.net/apps/banana.html</ext-link>
]</citation>
</ref>
<ref id="CR25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Flickinger</surname>
<given-names>RA</given-names>
</name>
</person-group>
<article-title>Transcriptional frequency and cell determination</article-title>
<source>J Theor Biol</source>
<year>2005</year>
<volume>232</volume>
<fpage>151</fpage>
<lpage>156</lpage>
<pub-id pub-id-type="doi">10.1016/j.jtbi.2004.05.020</pub-id>
</citation>
<citation citation-type="display-unstructured">Flickinger RA (2005) Transcriptional frequency and cell determination. J Theor Biol 232:151–156
<pub-id pub-id-type="pmid">15530486</pub-id>
</citation>
</ref>
<ref id="CR26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Furey</surname>
<given-names>TS</given-names>
</name>
<name>
<surname>Cristianini</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Duffy</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Bednarski</surname>
<given-names>DW</given-names>
</name>
<name>
<surname>Schummer</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Haussler</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Support vector machine classification and validation of cancer tissue samples using microarray expression data</article-title>
<source>Bioinformatics</source>
<year>2000</year>
<volume>16</volume>
<fpage>906</fpage>
<lpage>914</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/16.10.906</pub-id>
</citation>
<citation citation-type="display-unstructured">Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16:906–914
<pub-id pub-id-type="pmid">11120680</pub-id>
</citation>
</ref>
<ref id="CR27">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Galagan</surname>
<given-names>JE</given-names>
</name>
<name>
<surname>Calvo</surname>
<given-names>SE</given-names>
</name>
<name>
<surname>Borkovich</surname>
<given-names>KA</given-names>
</name>
<name>
<surname>Selker</surname>
<given-names>EU</given-names>
</name>
<name>
<surname>Read</surname>
<given-names>ND</given-names>
</name>
<name>
<surname>Jaffe</surname>
<given-names>D</given-names>
</name>
<name>
<surname>FitzHugh</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>L-J</given-names>
</name>
<name>
<surname>Smirnov</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Purcell</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The genome sequence of the filamentous fungus Neurospora crassa</article-title>
<source>Nature</source>
<year>2003</year>
<volume>422</volume>
<fpage>859</fpage>
<lpage>868</lpage>
<pub-id pub-id-type="doi">10.1038/nature01554</pub-id>
</citation>
<citation citation-type="display-unstructured">Galagan JE, Calvo SE, Borkovich KA, Selker EU, Read ND, Jaffe D, FitzHugh W, Ma L-J, Smirnov S, Purcell S et al (2003) The genome sequence of the filamentous fungus Neurospora crassa. Nature 422:859–868
<pub-id pub-id-type="pmid">12712197</pub-id>
</citation>
</ref>
<ref id="CR28">
<citation citation-type="other">Gao F, Foat B, Bussemaker H (2004) Defining transcriptional networks through integrative modelling of mRNA expression and transcription factor binding data. BMC Bioinformatics 5:31</citation>
</ref>
<ref id="CR29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gasch</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Moses</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Chiang</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Fraser</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Berardini</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Eisen</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Conservation and evolution of cis-regulatory systems in ascomycete fungi</article-title>
<source>PLOS Biol</source>
<year>2004</year>
<volume>2</volume>
<fpage>2202</fpage>
<lpage>2219</lpage>
<pub-id pub-id-type="doi">10.1371/journal.pbio.0020398</pub-id>
</citation>
<citation citation-type="display-unstructured">Gasch A, Moses A, Chiang D, Fraser H, Berardini M, Eisen M (2004) Conservation and evolution of cis-regulatory systems in ascomycete fungi. PLOS Biol 2:2202–2219 </citation>
</ref>
<ref id="CR30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Goodsell</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Dickerson</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Bending and curvature calculations in B-DNA</article-title>
<source>Nucleic Acids Res</source>
<year>1994</year>
<volume>22</volume>
<fpage>5497</fpage>
<lpage>5503</lpage>
<pub-id pub-id-type="doi">10.1093/nar/22.24.5497</pub-id>
</citation>
<citation citation-type="display-unstructured">Goodsell D, Dickerson R (1994) Bending and curvature calculations in B-DNA. Nucleic Acids Res 22:5497–5503
<pub-id pub-id-type="pmid">7816643</pub-id>
</citation>
</ref>
<ref id="CR31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Guyon</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Weston</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Barnhill</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Vapnik</surname>
<given-names>V</given-names>
</name>
</person-group>
<article-title>Gene selection for cancer classification using support vector machines</article-title>
<source>Mach Learn</source>
<year>2002</year>
<volume>46</volume>
<fpage>389</fpage>
<lpage>422</lpage>
<pub-id pub-id-type="doi">10.1023/A:1012487302797</pub-id>
</citation>
<citation citation-type="display-unstructured">Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422 </citation>
</ref>
<ref id="CR32">
<citation citation-type="other">Harbison C, Fraenkel E, Young R (2005) Web site: [
<ext-link ext-link-type="uri" xlink:href="http://www.jura.wi.mit.edu/fraenkel/download/release_v24/final_set/Final_InTableS2_v24.motifs]">http://www.jura.wi.mit.edu/fraenkel/download/release_v24/final_set/Final_InTableS2_v24.motifs]</ext-link>
</citation>
</ref>
<ref id="CR33">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Harbison</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Fraenkel</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Young</surname>
<given-names>R</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Transcriptional regulatory code of a eukaryotic genome</article-title>
<source>Nature</source>
<year>2004</year>
<volume>431</volume>
<fpage>99</fpage>
<lpage>104</lpage>
<pub-id pub-id-type="doi">10.1038/nature02800</pub-id>
</citation>
<citation citation-type="display-unstructured">Harbison C, Fraenkel E, Young R et al (2004) Transcriptional regulatory code of a eukaryotic genome. Nature 431:99–104
<pub-id pub-id-type="pmid">15343339</pub-id>
</citation>
</ref>
<ref id="CR34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Haverty</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Hansen</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Weng</surname>
<given-names>Z</given-names>
</name>
</person-group>
<article-title>Computational inference of transcriptional regulatory networks from expression profiling and transcription factor binding site identification</article-title>
<source>Nucleic Acids Res</source>
<year>2004</year>
<volume>32</volume>
<fpage>179</fpage>
<lpage>188</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkh183</pub-id>
</citation>
<citation citation-type="display-unstructured">Haverty P, Hansen U, Weng Z (2004) Computational inference of transcriptional regulatory networks from expression profiling and transcription factor binding site identification. Nucleic Acids Res 32:179–188
<pub-id pub-id-type="pmid">14704355</pub-id>
</citation>
</ref>
<ref id="CR35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Helden</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Regulatory sequence analysis tools</article-title>
<source>Nucleic Acids Res</source>
<year>2003</year>
<volume>31</volume>
<fpage>3593</fpage>
<lpage>3596</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkg567</pub-id>
</citation>
<citation citation-type="display-unstructured">van Helden J (2003) Regulatory sequence analysis tools. Nucleic Acids Res 31:3593–3596
<pub-id pub-id-type="pmid">12824373</pub-id>
</citation>
</ref>
<ref id="CR36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Helden</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Metrics for comparing regulatory sequences on the basis of pattern counts</article-title>
<source>Bioinformatics</source>
<year>2004</year>
<volume>20</volume>
<fpage>399</fpage>
<lpage>406</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg425</pub-id>
</citation>
<citation citation-type="display-unstructured">van Helden J (2004) Metrics for comparing regulatory sequences on the basis of pattern counts. Bioinformatics 20:399–406
<pub-id pub-id-type="pmid">14764560</pub-id>
</citation>
</ref>
<ref id="CR37">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Helden</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Collado-Vides</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies</article-title>
<source>J Mol Biol</source>
<year>1998</year>
<volume>281</volume>
<fpage>827</fpage>
<lpage>842</lpage>
<pub-id pub-id-type="doi">10.1006/jmbi.1998.1947</pub-id>
</citation>
<citation citation-type="display-unstructured">van Helden J, Collado-Vides J (1998) Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol 281:827–842
<pub-id pub-id-type="pmid">9719638</pub-id>
</citation>
</ref>
<ref id="CR38">
<citation citation-type="other">Hinnebusch A (1992) General and pathway-specific regulatory mechanisms controlling the synthesis of amino acid biosynthetic enzymes in saccharomyces cerevisiae. In: Broach JR, Jones EW, Pringle JR (eds) The molecular and cellular biology of the yeast Saccharomyces: gene expression. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, pp 319–414</citation>
</ref>
<ref id="CR39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hinnebusch</surname>
<given-names>AG</given-names>
</name>
<name>
<surname>Natarajan</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>Gcn4p, a master regulator of gene expression, is controlled at multiple levels by diverse signals of starvation and stress</article-title>
<source>Eukaryot Cell</source>
<year>2002</year>
<volume>1</volume>
<fpage>22</fpage>
<lpage>32</lpage>
<pub-id pub-id-type="doi">10.1128/EC.01.1.22-32.2002</pub-id>
</citation>
<citation citation-type="display-unstructured">Hinnebusch AG, Natarajan K (2002) Gcn4p, a master regulator of gene expression, is controlled at multiple levels by diverse signals of starvation and stress. Eukaryot Cell 1:22–32
<pub-id pub-id-type="pmid">12455968</pub-id>
</citation>
</ref>
<ref id="CR40">
<citation citation-type="other">Holloway D, Kon M, DeLisi C (2006) Machine learning methods for transcription data integration. IBM J Res Develop Syst Biol 50: (in press)</citation>
</ref>
<ref id="CR41">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hua</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>Z</given-names>
</name>
</person-group>
<article-title>A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach</article-title>
<source>J Mol Biol</source>
<year>2001a</year>
<volume>308</volume>
<fpage>397</fpage>
<lpage>407</lpage>
<pub-id pub-id-type="doi">10.1006/jmbi.2001.4580</pub-id>
</citation>
<citation citation-type="display-unstructured">Hua S, Sun Z (2001a) A novel method of protein secondary structure prediction with high segment overlap measure:support vector machine approach. J Mol Biol 308:397–407
<pub-id pub-id-type="pmid">11327775</pub-id>
</citation>
</ref>
<ref id="CR42">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hua</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>Z</given-names>
</name>
</person-group>
<article-title>Support vector machine approach for protein subcellular localization prediction</article-title>
<source>Bioinformatics</source>
<year>2001b</year>
<volume>18</volume>
<fpage>721</fpage>
<lpage>728</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/17.8.721</pub-id>
</citation>
<citation citation-type="display-unstructured">Hua S, Sun Z (2001b) Support vector machine approach for protein subcellular localization prediction. Bioinformatics 18:721–728
<pub-id pub-id-type="pmid">11524373</pub-id>
</citation>
</ref>
<ref id="CR43">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ihmels</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Barkai</surname>
<given-names>N</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Revealing modular organization in the yeast transcriptional network</article-title>
<source>Nat Genet</source>
<year>2002</year>
<volume>31</volume>
<fpage>370</fpage>
<lpage>377</lpage>
</citation>
<citation citation-type="display-unstructured">Ihmels J, Barkai N et al (2002) Revealing modular organization in the yeast transcriptional network. Nat Genet 31:370–377
<pub-id pub-id-type="pmid">12134151</pub-id>
</citation>
</ref>
<ref id="CR44">
<citation citation-type="other">Ihmels J, Bergman S, Barkai N (2005) Barkai Lab: [
<ext-link ext-link-type="uri" xlink:href="http://www.barkai-serv.weizmann.ac.il/GroupPage/">http://www.barkai-serv.weizmann.ac.il/GroupPage/</ext-link>
]</citation>
</ref>
<ref id="CR45">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ihmels</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Bergman</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Barkai</surname>
<given-names>N</given-names>
</name>
</person-group>
<article-title>Defining transcription modules using large-scale gene expression data</article-title>
<source>Bioinformatics</source>
<year>2004</year>
<volume>20</volume>
<fpage>1993</fpage>
<lpage>2003</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bth166</pub-id>
</citation>
<citation citation-type="display-unstructured">Ihmels J, Bergman S, Barkai N (2004) Defining transcription modules using large-scale gene expression data. Bioinformatics 20:1993–2003
<pub-id pub-id-type="pmid">15044247</pub-id>
</citation>
</ref>
<ref id="CR46">
<citation citation-type="other">Jaakola T, Diekhans M, Haussler D (1999) Using the Fisher kernel method to detect remote protein homologies. In: Proc Int Conf INtell Syst Mol Biol, AAAI Press, pp 149–158</citation>
</ref>
<ref id="CR47">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Keles</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Laan</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Vulpe</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Regulatory motif finding by logic regression</article-title>
<source>Bioinformatics</source>
<year>2004</year>
<volume>20</volume>
<fpage>2799</fpage>
<lpage>2811</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bth333</pub-id>
</citation>
<citation citation-type="display-unstructured">Keles S, van der Laan MJ, Vulpe C (2004) Regulatory motif finding by logic regression. Bioinformatics 20:2799–2811
<pub-id pub-id-type="pmid">15166027</pub-id>
</citation>
</ref>
<ref id="CR48">
<citation citation-type="other">Kellis M Website: [
<ext-link ext-link-type="uri" xlink:href="http://www.broad.mit.edu/annotation/fungi/comp_yeasts/">http://www.broad.mit.edu/annotation/fungi/comp_yeasts/</ext-link>
], 2003</citation>
</ref>
<ref id="CR49">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kellis</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Patterson</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Endrizzi</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Birren</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Lander</surname>
<given-names>ES</given-names>
</name>
</person-group>
<article-title>Sequencing and comparison of yeast species to identify genes and regulatory elements</article-title>
<source>Nature</source>
<year>2003</year>
<volume>423</volume>
<fpage>241</fpage>
<lpage>254</lpage>
<pub-id pub-id-type="doi">10.1038/nature01644</pub-id>
</citation>
<citation citation-type="display-unstructured">Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES (2003) Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423:241–254
<pub-id pub-id-type="pmid">12748633</pub-id>
</citation>
</ref>
<ref id="CR50">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kent</surname>
<given-names>NA</given-names>
</name>
<name>
<surname>Eibert</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Mellor</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Cbf1p is required for chromatin remodelling at promoter-proximal CACGTG motifs in yeast</article-title>
<source>J Biol Chem</source>
<year>2004</year>
<volume>279</volume>
<fpage>27116</fpage>
<lpage>27123</lpage>
<pub-id pub-id-type="doi">10.1074/jbc.M403818200</pub-id>
</citation>
<citation citation-type="display-unstructured">Kent NA, Eibert SM, Mellor J (2004) Cbf1p is required for chromatin remodelling at promoter-proximal CACGTG motifs in yeast. J Biol Chem 279:27116–27123
<pub-id pub-id-type="pmid">15111622</pub-id>
</citation>
</ref>
<ref id="CR51">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lanckriet</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Cristianini</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Jordan</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Noble</surname>
<given-names>WS</given-names>
</name>
</person-group>
<article-title>A statistical framework for genomic data fusion</article-title>
<source>Bioinformatics</source>
<year>2004</year>
<volume>20</volume>
<fpage>2626</fpage>
<lpage>2635</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bth294</pub-id>
</citation>
<citation citation-type="display-unstructured">Lanckriet G, Cristianini N, Jordan M, Noble WS (2004) A statistical framework for genomic data fusion. Bioinformatics 20:2626–2635
<pub-id pub-id-type="pmid">15130933</pub-id>
</citation>
</ref>
<ref id="CR52">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lee</surname>
<given-names>IT</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Transcriptional regulatory networks in Saccharomyces cerevisiae</article-title>
<source>Science</source>
<year>2002</year>
<volume>298</volume>
<fpage>799</fpage>
<lpage>804</lpage>
<pub-id pub-id-type="doi">10.1126/science.1075090</pub-id>
</citation>
<citation citation-type="display-unstructured">Lee IT et al (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298:799–804
<pub-id pub-id-type="pmid">12399584</pub-id>
</citation>
</ref>
<ref id="CR53">
<citation citation-type="other">Leslie C, Kuang R (2003) Fast kernels for inexact string matching. In: 16th Annual Conference on Computational Learning Theory and 7th Kernel Workshop Proceedings, pp 114–128</citation>
</ref>
<ref id="CR54">
<citation citation-type="other">Leslie C, Eskin E, Noble WS (2002) The Spectrum Kernel: a string kernel for SVM protein classification. In: Proceedings of the Pacific Symposium on Biocomputing, pp 564–575</citation>
</ref>
<ref id="CR55">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Leslie</surname>
<given-names>CS</given-names>
</name>
<name>
<surname>Eskin</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Cohen</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Weston</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Noble</surname>
<given-names>WS</given-names>
</name>
</person-group>
<article-title>Mismatch string kernels for discriminative protein classification</article-title>
<source>Bioinformatics</source>
<year>2004</year>
<volume>20</volume>
<fpage>467</fpage>
<lpage>476</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg431</pub-id>
</citation>
<citation citation-type="display-unstructured">Leslie CS, Eskin E, Cohen A, Weston J, Noble WS (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20:467–476
<pub-id pub-id-type="pmid">14990442</pub-id>
</citation>
</ref>
<ref id="CR56">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mangan</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Zaslaver</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Alon</surname>
<given-names>U</given-names>
</name>
</person-group>
<article-title>The coherent feedforward loop serves as a sign-sensitive delay element in transcription networks</article-title>
<source>J Mol Biol</source>
<year>2003</year>
<volume>334</volume>
<fpage>197</fpage>
<lpage>204</lpage>
<pub-id pub-id-type="doi">10.1016/j.jmb.2003.09.049</pub-id>
</citation>
<citation citation-type="display-unstructured">Mangan S, Zaslaver A, Alon U (2003) The coherent feedforward loop serves as a sign-sensitive delay element in transcription networks. J Mol Biol 334:197–204
<pub-id pub-id-type="pmid">14607112</pub-id>
</citation>
</ref>
<ref id="CR57">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Martin</surname>
<given-names>DE</given-names>
</name>
<name>
<surname>Soulard</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Hall</surname>
<given-names>MN</given-names>
</name>
</person-group>
<article-title>TOR regulates ribosomal protein gene expression via PKA and the forkhead transcription factor FHL1</article-title>
<source>Cell</source>
<year>2004</year>
<volume>119</volume>
<fpage>969</fpage>
<lpage>979</lpage>
<pub-id pub-id-type="doi">10.1016/j.cell.2004.11.047</pub-id>
</citation>
<citation citation-type="display-unstructured">Martin DE, Soulard A, Hall MN (2004) TOR regulates ribosomal protein gene expression via PKA and the forkhead transcription factor FHL1. Cell 119:969–979
<pub-id pub-id-type="pmid">15620355</pub-id>
</citation>
</ref>
<ref id="CR58">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Masters</surname>
<given-names>KM</given-names>
</name>
<name>
<surname>Parkhurst</surname>
<given-names>KM</given-names>
</name>
<name>
<surname>Daugherty</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Parkhurst</surname>
<given-names>LJ</given-names>
</name>
</person-group>
<article-title>Native human TATA-binding protein simultaneously binds and bends promoter DNA without a slow isomerization step or TFIIB requirement</article-title>
<source>J Biol Chem</source>
<year>2003</year>
<volume>278</volume>
<fpage>31685</fpage>
<lpage>31690</lpage>
<pub-id pub-id-type="doi">10.1074/jbc.M305201200</pub-id>
</citation>
<citation citation-type="display-unstructured">Masters KM, Parkhurst KM, Daugherty MA, Parkhurst LJ (2003) Native human TATA-binding protein simultaneously binds and bends promoter DNA without a slow isomerization step or TFIIB requirement. J Biol Chem 278:31685–31690
<pub-id pub-id-type="pmid">12791683</pub-id>
</citation>
</ref>
<ref id="CR59">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Matys</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Kel-Margoulis</surname>
<given-names>OV</given-names>
</name>
<name>
<surname>Fricke</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Liebich</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Land</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Barre-Dirrie</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Reuter</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Chekmenev</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Krull</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Hornischer</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
<article-title>TRANSFAC(R) and its module TRANSCompel(R): transcriptional gene regulation in eukaryotes</article-title>
<source>Nucleic Acids Res</source>
<year>2005</year>
<volume>34</volume>
<fpage>D108</fpage>
<lpage>D110</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkj143</pub-id>
</citation>
<citation citation-type="display-unstructured">Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K et al (2005) TRANSFAC(R) and its module TRANSCompel(R): transcriptional gene regulation in eukaryotes. Nucleic Acids Res 34:D108–D110
<pub-id pub-id-type="pmid">16381825</pub-id>
</citation>
</ref>
<ref id="CR60">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mellor</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>J</given-names>
</name>
<name>
<surname>DeLisi</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Constructing networks with correlation maximization methods</article-title>
<source>Genome Informatics</source>
<year>2004</year>
<volume>15</volume>
<fpage>149</fpage>
<lpage>159</lpage>
</citation>
<citation citation-type="display-unstructured">Mellor J, Wu J, DeLisi C (2004) Constructing networks with correlation maximization methods. Genome Informatics 15:149–159
<pub-id pub-id-type="pmid">15712118</pub-id>
</citation>
</ref>
<ref id="CR61">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mountain</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Bytrom</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Korch</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>The general amino acid control regulates MET4, which encodes a methionine-pathway-specific transcriptional activator of Saccharomyces cerevisiae</article-title>
<source>Mol Microbiol</source>
<year>1993</year>
<volume>9</volume>
<fpage>221</fpage>
<lpage>223</lpage>
<pub-id pub-id-type="doi">10.1111/j.1365-2958.1993.tb01684.x</pub-id>
</citation>
<citation citation-type="display-unstructured">Mountain H, Bytrom A, Korch C (1993) The general amino acid control regulates MET4, which encodes a methionine-pathway-specific transcriptional activator of Saccharomyces cerevisiae. Mol Microbiol 9:221–223
<pub-id pub-id-type="pmid">8412668</pub-id>
</citation>
</ref>
<ref id="CR62">
<citation citation-type="other">Parker S, Greenbaum J, Benson G, Tullius TD (2005) Structure-based DNA sequence alignment. In: poster: 5th International Workshop in Bioinformatics and Systems Biology</citation>
</ref>
<ref id="CR63">
<citation citation-type="other">Pavlidis P, Noble WS (2001) Gene functional classification from heterogeneous data. In: RECOMB Conference Proceedings, pp 249–255</citation>
</ref>
<ref id="CR64">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pavlidis</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Wapinski</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Noble</surname>
<given-names>WS</given-names>
</name>
</person-group>
<article-title>Support vector machine classification on the web</article-title>
<source>Bioinformatics</source>
<year>2004</year>
<volume>20</volume>
<fpage>586</fpage>
<lpage>587</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg461</pub-id>
</citation>
<citation citation-type="display-unstructured">Pavlidis P, Wapinski I, Noble WS (2004) Support vector machine classification on the web. Bioinformatics 20:586–587
<pub-id pub-id-type="pmid">14990457</pub-id>
</citation>
</ref>
<ref id="CR65">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pina</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Fernandez-Larrea</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Garcia-Reyero</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Idrissi</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>The different (sur)faces of Rap1p</article-title>
<source>Mol Genet Genomics</source>
<year>2003</year>
<volume>268</volume>
<fpage>791</fpage>
<lpage>798</lpage>
</citation>
<citation citation-type="display-unstructured">Pina B, Fernandez-Larrea J, Garcia-Reyero N, Idrissi F (2003) The different (sur)faces of Rap1p. Mol Genet Genomics 268:791–798
<pub-id pub-id-type="pmid">12655405</pub-id>
</citation>
</ref>
<ref id="CR66">
<citation citation-type="other">Platt JC (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Smola A, Bartlett P, Scholkopf D, Schuurmans D (eds) Advances in large margin classifiers. MIT Press, Cambridge, pp 61–74</citation>
</ref>
<ref id="CR67">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pritsker</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Y-C</given-names>
</name>
<name>
<surname>Beer</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Tavazoie</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Whole-genome discovery of transcription factor binding sites by network-level conservation</article-title>
<source>Genome Res</source>
<year>2004</year>
<volume>14</volume>
<fpage>99</fpage>
<lpage>108</lpage>
<pub-id pub-id-type="doi">10.1101/gr.1739204</pub-id>
</citation>
<citation citation-type="display-unstructured">Pritsker M, Liu Y-C, Beer MA, Tavazoie S (2004) Whole-genome discovery of transcription factor binding sites by network-level conservation. Genome Res 14:99–108
<pub-id pub-id-type="pmid">14672978</pub-id>
</citation>
</ref>
<ref id="CR68">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qian</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Luscombe</surname>
<given-names>NM</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Gerstein</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Prediction of regulatory networks: genome-wide identification of transcription factor targets from gene expression data</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>19</volume>
<fpage>1917</fpage>
<lpage>1926</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg347</pub-id>
</citation>
<citation citation-type="display-unstructured">Qian J, Lin J, Luscombe NM, Yu H, Gerstein M (2003) Prediction of regulatory networks: genome-wide identification of transcription factor targets from gene expression data. Bioinformatics 19:1917–1926
<pub-id pub-id-type="pmid">14555624</pub-id>
</citation>
</ref>
<ref id="CR69">
<citation citation-type="other">Rice P, Longden I, Bleasby A (2000) EMBOSS: the European molecular biology open software suite. Trends Genet 16:276–277</citation>
</ref>
<ref id="CR70">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Satchwell</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Drew</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Travers</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Sequence periodicities in chicken nucleosome core DNA</article-title>
<source>J Mol Biol</source>
<year>1986</year>
<volume>191</volume>
<fpage>659</fpage>
<lpage>675</lpage>
<pub-id pub-id-type="doi">10.1016/0022-2836(86)90452-3</pub-id>
</citation>
<citation citation-type="display-unstructured">Satchwell S, Drew H, Travers A (1986) Sequence periodicities in chicken nucleosome core DNA. J Mol Biol 191:659–675
<pub-id pub-id-type="pmid">3806678</pub-id>
</citation>
</ref>
<ref id="CR71">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schneider</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Stephens</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Sequence logos: a new way to display consensus sequences</article-title>
<source>Nucleic Acids Res</source>
<year>1990</year>
<volume>18</volume>
<fpage>6097</fpage>
<lpage>6100</lpage>
<pub-id pub-id-type="doi">10.1093/nar/18.20.6097</pub-id>
</citation>
<citation citation-type="display-unstructured">Schneider T, Stephens R (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 18:6097–6100
<pub-id pub-id-type="pmid">2172928</pub-id>
</citation>
</ref>
<ref id="CR72">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schneider</surname>
<given-names>TD</given-names>
</name>
<name>
<surname>Stormo</surname>
<given-names>GD</given-names>
</name>
<name>
<surname>Gold</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>A Ehrenfeucht: information content of binding sites on nucleotide sequences</article-title>
<source>J Mol Biol</source>
<year>1986</year>
<volume>188</volume>
<fpage>415</fpage>
<lpage>431</lpage>
<pub-id pub-id-type="doi">10.1016/0022-2836(86)90165-8</pub-id>
</citation>
<citation citation-type="display-unstructured">Schneider TD, Stormo GD, Gold L (1986) A Ehrenfeucht: information content of binding sites on nucleotide sequences. J Mol Biol 188:415–431
<pub-id pub-id-type="pmid">3525846</pub-id>
</citation>
</ref>
<ref id="CR73">
<citation citation-type="other">Sholkopf B, Smola AJ (2002) Learning with Kernels. MIT Press, Cambridge</citation>
</ref>
<ref id="CR74">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Simonis</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Wodak</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Cohen</surname>
<given-names>GN</given-names>
</name>
<name>
<surname>Helden</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Combining pattern discovery and discriminant analysis to predict gene co-regulation</article-title>
<source>Bioinformatics</source>
<year>2004</year>
<volume>20</volume>
<fpage>2370</fpage>
<lpage>2379</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bth252</pub-id>
</citation>
<citation citation-type="display-unstructured">Simonis N, Wodak SJ, Cohen GN, van Helden J (2004) Combining pattern discovery and discriminant analysis to predict gene co-regulation. Bioinformatics 20:2370–2379
<pub-id pub-id-type="pmid">15073004</pub-id>
</citation>
</ref>
<ref id="CR75">
<citation citation-type="other">Smit A, Hubley R, Green P (2005) Repeatmasker Open 3.0:[
<ext-link ext-link-type="uri" xlink:href="http://www.repeatmasker.org">http://www.repeatmasker.org</ext-link>
]</citation>
</ref>
<ref id="CR77">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Stormo</surname>
<given-names>GD</given-names>
</name>
</person-group>
<article-title>DNA binding sites: representation and discovery</article-title>
<source>Bioinformatics</source>
<year>2000</year>
<volume>16</volume>
<fpage>16</fpage>
<lpage>23</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/16.1.16</pub-id>
</citation>
<citation citation-type="display-unstructured">Stormo GD (2000) DNA binding sites: representation and discovery. Bioinformatics 16:16–23
<pub-id pub-id-type="pmid">10812473</pub-id>
</citation>
</ref>
<ref id="CR78">
<citation citation-type="other">Tan P-N, Steinbach M, Kumar V (2005) Introduction to data mining. Pearson Education</citation>
</ref>
<ref id="CR79">
<citation citation-type="other">Tatusov RL, Lipman DJ (2005) dust. NCBI Toolkit: [
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/]">http://www.ncbi.nlm.nih.gov/]</ext-link>
</citation>
</ref>
<ref id="CR80">
<citation citation-type="other">The Mathworks: [
<ext-link ext-link-type="uri" xlink:href="http://www.mathworks.com/">http://www.mathworks.com/</ext-link>
]</citation>
</ref>
<ref id="CR81">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tompa</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Assessing computational tools for the discovery of transcription factor binding sites</article-title>
<source>Nat Biotechnol</source>
<year>2005</year>
<volume>23</volume>
<fpage>137</fpage>
<lpage>144</lpage>
<pub-id pub-id-type="doi">10.1038/nbt1053</pub-id>
</citation>
<citation citation-type="display-unstructured">Tompa M et al (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 23:137–144
<pub-id pub-id-type="pmid">15637633</pub-id>
</citation>
</ref>
<ref id="CR82">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tullius</surname>
<given-names>TD</given-names>
</name>
<name>
<surname>Greenbaum</surname>
<given-names>JA</given-names>
</name>
</person-group>
<article-title>Mapping nucleic acid structure by hydroxyl radical cleavage</article-title>
<source>Curr Opin Chem Biol</source>
<year>2005</year>
<volume>9</volume>
<fpage>127</fpage>
<lpage>134</lpage>
<pub-id pub-id-type="doi">10.1016/j.cbpa.2005.02.009</pub-id>
</citation>
<citation citation-type="display-unstructured">Tullius TD, Greenbaum JA (2005) Mapping nucleic acid structure by hydroxyl radical cleavage. Curr Opin Chem Biol 9:127–134
<pub-id pub-id-type="pmid">15811796</pub-id>
</citation>
</ref>
<ref id="CR83">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Cherry</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Botstein</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>A systematic approach to reconstructing transcription networks in Saccharomyces scerevisiae</article-title>
<source>PNAS</source>
<year>2002</year>
<volume>99</volume>
<fpage>16893</fpage>
<lpage>16898</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.252638199</pub-id>
</citation>
<citation citation-type="display-unstructured">Wang W, Cherry JM, Botstein D, Li H (2002) A systematic approach to reconstructing transcription networks in Saccharomyces scerevisiae. PNAS 99:16893–16898
<pub-id pub-id-type="pmid">12482955</pub-id>
</citation>
</ref>
<ref id="CR84">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>K-C</given-names>
</name>
</person-group>
<article-title>Using string kernel to predict signal peptide cleavage site based on subsite coupling model</article-title>
<source>Amino Acids</source>
<year>2005</year>
<volume>28</volume>
<fpage>395</fpage>
<lpage>402</lpage>
<pub-id pub-id-type="doi">10.1007/s00726-005-0189-6</pub-id>
</citation>
<citation citation-type="display-unstructured">Wang M, Yang J, Chou K-C (2005) Using string kernel to predict signal peptide cleavage site based on subsite coupling model. Amino Acids 28:395–402
<pub-id pub-id-type="pmid">15838592</pub-id>
</citation>
</ref>
<ref id="CR85">
<citation citation-type="other">Weston J, Elisseeff A, Bakir G, Sinz F et al (2005) SPIDER: object oriented machine learning library version 6: [
<ext-link ext-link-type="uri" xlink:href="http://www.kyb.tuebingen.mpg.de/bs/people/spider/">http://www.kyb.tuebingen.mpg.de/bs/people/spider/</ext-link>
]</citation>
</ref>
<ref id="CR86">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wheeler</surname>
<given-names>DL</given-names>
</name>
<name>
<surname>Barrett</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Benson</surname>
<given-names>DA</given-names>
</name>
<name>
<surname>Bryant</surname>
<given-names>SH</given-names>
</name>
<name>
<surname>Canese</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Church</surname>
<given-names>DM</given-names>
</name>
<name>
<surname>DiCuccio</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Edgar</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Federhen</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Helmberg</surname>
<given-names>W</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Database resources of the National Center for Biotechnology Information</article-title>
<source>Nucleic Acids Res</source>
<year>2005</year>
<volume>33</volume>
<fpage>D39</fpage>
<lpage>D45</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gki062</pub-id>
</citation>
<citation citation-type="display-unstructured">Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Church DM, DiCuccio M, Edgar R, Federhen S, Helmberg W et al (2005) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 33:D39–D45
<pub-id pub-id-type="pmid">15608222</pub-id>
</citation>
</ref>
<ref id="CR87">
<citation citation-type="other">Workman CT, Stormo GD (2000) ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. In: Pac Symp Biocomput, pp 467–478</citation>
</ref>
<ref id="CR88">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Kasif</surname>
<given-names>S</given-names>
</name>
<name>
<surname>DeLisi</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Identification of functional links between genes using phylogenetic profiles</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>19</volume>
<fpage>1</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/19.1.1</pub-id>
</citation>
<citation citation-type="display-unstructured">Wu J, Kasif S, DeLisi C (2003) Identification of functional links between genes using phylogenetic profiles. Bioinformatics 19:1–7 </citation>
</ref>
<ref id="CR89">
<citation citation-type="other">Young Lab Web Data: [
<ext-link ext-link-type="uri" xlink:href="http://www.staffa.wi.mit.edu/cgi-bin/young_public/navframe.cgi?s=17&f=evidence]">http://www.staffa.wi.mit.edu/cgi-bin/young_public/navframe.cgi?s=17&f=evidence]</ext-link>
</citation>
</ref>
<ref id="CR90">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Luscombe</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Qian</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Gerstein</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Genomic analysis of gene expression relationships in transcriptional regulatory networks</article-title>
<source>Trends Genet</source>
<year>2003</year>
<volume>19</volume>
<fpage>422</fpage>
<lpage>427</lpage>
<pub-id pub-id-type="doi">10.1016/S0168-9525(03)00175-6</pub-id>
</citation>
<citation citation-type="display-unstructured">Yu H, Luscombe N, Qian J, Gerstein M (2003) Genomic analysis of gene expression relationships in transcriptional regulatory networks. Trends Genet 19:422–427
<pub-id pub-id-type="pmid">12902159</pub-id>
</citation>
</ref>
<ref id="CR91">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zheng</surname>
<given-names>X-F</given-names>
</name>
<name>
<surname>Schreiber</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>Target of rapamycin proteins and their kinase activities are required for meiosis</article-title>
<source>PNAS</source>
<year>1997</year>
<volume>94</volume>
<fpage>3070</fpage>
<lpage>3075</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.94.7.3070</pub-id>
</citation>
<citation citation-type="display-unstructured">Zheng X-F, Schreiber SL (1997) Target of rapamycin proteins and their kinase activities are required for meiosis. PNAS 94:3070–3075
<pub-id pub-id-type="pmid">9096347</pub-id>
</citation>
</ref>
<ref id="CR92">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhu</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Pilpel</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Church</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Computational identification of transcription factor binding sites via a transcription-factor-centric-clustering (TFCC) algorithm</article-title>
<source>J Mol Biol</source>
<year>2002</year>
<volume>318</volume>
<fpage>71</fpage>
<lpage>81</lpage>
<pub-id pub-id-type="doi">10.1016/S0022-2836(02)00026-8</pub-id>
</citation>
<citation citation-type="display-unstructured">Zhu Z, Pilpel Y, Church G (2002) Computational identification of transcription factor binding sites via a transcription-factor-centric-clustering (TFCC) algorithm. J Mol Biol 318:71–81
<pub-id pub-id-type="pmid">12054769</pub-id>
</citation>
</ref>
<ref id="CR93">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zien</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ratsch</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Mika</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Scholkopf</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Lengauer</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Muller</surname>
<given-names>K-R</given-names>
</name>
</person-group>
<article-title>Engineering support vector machine kernels that recognize translation initiation sites</article-title>
<source>Bioinformatics</source>
<year>2000</year>
<volume>16</volume>
<fpage>799</fpage>
<lpage>807</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/16.9.799</pub-id>
</citation>
<citation citation-type="display-unstructured">Zien A, Ratsch G, Mika S, Scholkopf B, Lengauer T, Muller K-R (2000) Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16:799–807
<pub-id pub-id-type="pmid">11108702</pub-id>
</citation>
</ref>
<ref id="CR94">
<citation citation-type="other">Zubay G (1996) Biochemistry, 4th edn. Columbia University, WCB Publishers, pp 297–335</citation>
</ref>
</ref-list>
<fn-group>
<fn>
<p>
<bold>Electronic Supplementary Material</bold>
</p>
<p>Supplementary material is available in the online version of this article at
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1007/s11693-006-9003-3">http://dx.doi.org/10.1007/s11693-006-9003-3</ext-link>
and is accessible for authorized users.</p>
</fn>
<fn>
<p>
<bold>Authors' contributions</bold>
</p>
<p>DH coded the required software in Matlab and Perl, conceived of many of the design implementations, and wrote this article. All authors made contributions to this manuscript and developed the experimental design. CD initially conceived and motivated this work. All authors read and approved the final manuscript.</p>
</fn>
</fn-group>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000566 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000566 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:2533145
   |texte=   Machine learning for regulatory analysis and transcription factor target prediction in yeast
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:19003435" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021