MersV1, Pmc, Corpus, bibRecord, 000938

Predicting the binding preference of transcription factors to individual DNA k-mers

Identifieur interne : 000938 ( Pmc/Corpus ); précédent : 000937; suivant : 000939

Predicting the binding preference of transcription factors to individual DNA k-mers

Auteurs : Trevis M. Alleyne ; Lourdes Pe A-Castillo ; Gwenael Badis ; Shaheynoor Talukder ; Michael F. Berger ; Andrew R. Gehrke ; Anthony A. Philippakis ; Martha L. Bulyk ; Quaid D. Morris ; Timothy R. Hughes

Source :

Bioinformatics [ 1367-4803 ] ; 2008.

RBID : PMC:2666811

Abstract

Motivation: Recognition of specific DNA sequences is a central mechanism by which transcription factors (TFs) control gene expression. Many TF-binding preferences, however, are unknown or poorly characterized, in part due to the difficulty associated with determining their specificity experimentally, and an incomplete understanding of the mechanisms governing sequence specificity. New techniques that estimate the affinity of TFs to all possible k-mers provide a new opportunity to study DNA–protein interaction mechanisms, and may facilitate inference of binding preferences for members of a given TF family when such information is available for other family members.

Results: We employed a new dataset consisting of the relative preferences of mouse homeodomains for all eight-base DNA sequences in order to ask how well we can predict the binding profiles of homeodomains when only their protein sequences are given. We evaluated a panel of standard statistical inference techniques, as well as variations of the protein features considered. Nearest neighbour among functionally important residues emerged among the most effective methods. Our results underscore the complexity of TF–DNA recognition, and suggest a rational approach for future analyses of TF families.

Contact: t.hughes@utorotno.ca

Supplementary information: Supplementary data are available at Bioinformatics online.

Url:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2666811

DOI: 10.1093/bioinformatics/btn645
PubMed: 19088121
PubMed Central: 2666811

Links to Exploration step

PMC:2666811

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Predicting the binding preference of transcription factors to individual DNA <italic>k</italic>
-mers</title>
<author><name sortKey="Alleyne, Trevis M" sort="Alleyne, Trevis M" uniqKey="Alleyne T" first="Trevis M." last="Alleyne">Trevis M. Alleyne</name>
<affiliation><nlm:aff id="AFF1">Department of Molecular Genetics,</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Pe A Castillo, Lourdes" sort="Pe A Castillo, Lourdes" uniqKey="Pe A Castillo L" first="Lourdes" last="Pe A-Castillo">Lourdes Pe A-Castillo</name>
<affiliation><nlm:aff id="AFF1">Banting and Best Department of Medical Research, University of Toronto, Toronto, ON M5S 3E1, Canada,</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Badis, Gwenael" sort="Badis, Gwenael" uniqKey="Badis G" first="Gwenael" last="Badis">Gwenael Badis</name>
<affiliation><nlm:aff id="AFF1">Banting and Best Department of Medical Research, University of Toronto, Toronto, ON M5S 3E1, Canada,</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Talukder, Shaheynoor" sort="Talukder, Shaheynoor" uniqKey="Talukder S" first="Shaheynoor" last="Talukder">Shaheynoor Talukder</name>
<affiliation><nlm:aff id="AFF1">Department of Molecular Genetics,</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Berger, Michael F" sort="Berger, Michael F" uniqKey="Berger M" first="Michael F." last="Berger">Michael F. Berger</name>
<affiliation><nlm:aff id="AFF1">Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115,</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="AFF1">Committee on Higher Degrees in Biophysics, Harvard University, Cambridge, MA 02138,</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Gehrke, Andrew R" sort="Gehrke, Andrew R" uniqKey="Gehrke A" first="Andrew R." last="Gehrke">Andrew R. Gehrke</name>
<affiliation><nlm:aff id="AFF1">Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115,</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Philippakis, Anthony A" sort="Philippakis, Anthony A" uniqKey="Philippakis A" first="Anthony A." last="Philippakis">Anthony A. Philippakis</name>
<affiliation><nlm:aff id="AFF1">Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115,</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="AFF1">Committee on Higher Degrees in Biophysics, Harvard University, Cambridge, MA 02138,</nlm:aff>
</affiliation>
<affiliation><nlm:aff wicri:cut=" and" id="AFF1">Harvard/MIT Division of Health Sciences and Technology (HST), Harvard Medical School, Boston, MA 02115</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Bulyk, Martha L" sort="Bulyk, Martha L" uniqKey="Bulyk M" first="Martha L." last="Bulyk">Martha L. Bulyk</name>
<affiliation><nlm:aff id="AFF1">Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115,</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="AFF1">Committee on Higher Degrees in Biophysics, Harvard University, Cambridge, MA 02138,</nlm:aff>
</affiliation>
<affiliation><nlm:aff wicri:cut=" and" id="AFF1">Harvard/MIT Division of Health Sciences and Technology (HST), Harvard Medical School, Boston, MA 02115</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="AFF1">Department of Pathology, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Morris, Quaid D" sort="Morris, Quaid D" uniqKey="Morris Q" first="Quaid D." last="Morris">Quaid D. Morris</name>
<affiliation><nlm:aff id="AFF1">Department of Molecular Genetics,</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="AFF1">Banting and Best Department of Medical Research, University of Toronto, Toronto, ON M5S 3E1, Canada,</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Hughes, Timothy R" sort="Hughes, Timothy R" uniqKey="Hughes T" first="Timothy R." last="Hughes">Timothy R. Hughes</name>
<affiliation><nlm:aff id="AFF1">Department of Molecular Genetics,</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="AFF1">Banting and Best Department of Medical Research, University of Toronto, Toronto, ON M5S 3E1, Canada,</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">19088121</idno>
<idno type="pmc">2666811</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2666811</idno>
<idno type="RBID">PMC:2666811</idno>
<idno type="doi">10.1093/bioinformatics/btn645</idno>
<date when="2008">2008</date>
<idno type="wicri:Area/Pmc/Corpus">000938</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000938</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Predicting the binding preference of transcription factors to individual DNA <italic>k</italic>
-mers</title>
<author><name sortKey="Alleyne, Trevis M" sort="Alleyne, Trevis M" uniqKey="Alleyne T" first="Trevis M." last="Alleyne">Trevis M. Alleyne</name>
<affiliation><nlm:aff id="AFF1">Department of Molecular Genetics,</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Pe A Castillo, Lourdes" sort="Pe A Castillo, Lourdes" uniqKey="Pe A Castillo L" first="Lourdes" last="Pe A-Castillo">Lourdes Pe A-Castillo</name>
<affiliation><nlm:aff id="AFF1">Banting and Best Department of Medical Research, University of Toronto, Toronto, ON M5S 3E1, Canada,</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Badis, Gwenael" sort="Badis, Gwenael" uniqKey="Badis G" first="Gwenael" last="Badis">Gwenael Badis</name>
<affiliation><nlm:aff id="AFF1">Banting and Best Department of Medical Research, University of Toronto, Toronto, ON M5S 3E1, Canada,</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Talukder, Shaheynoor" sort="Talukder, Shaheynoor" uniqKey="Talukder S" first="Shaheynoor" last="Talukder">Shaheynoor Talukder</name>
<affiliation><nlm:aff id="AFF1">Department of Molecular Genetics,</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Berger, Michael F" sort="Berger, Michael F" uniqKey="Berger M" first="Michael F." last="Berger">Michael F. Berger</name>
<affiliation><nlm:aff id="AFF1">Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115,</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="AFF1">Committee on Higher Degrees in Biophysics, Harvard University, Cambridge, MA 02138,</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Gehrke, Andrew R" sort="Gehrke, Andrew R" uniqKey="Gehrke A" first="Andrew R." last="Gehrke">Andrew R. Gehrke</name>
<affiliation><nlm:aff id="AFF1">Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115,</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Philippakis, Anthony A" sort="Philippakis, Anthony A" uniqKey="Philippakis A" first="Anthony A." last="Philippakis">Anthony A. Philippakis</name>
<affiliation><nlm:aff id="AFF1">Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115,</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="AFF1">Committee on Higher Degrees in Biophysics, Harvard University, Cambridge, MA 02138,</nlm:aff>
</affiliation>
<affiliation><nlm:aff wicri:cut=" and" id="AFF1">Harvard/MIT Division of Health Sciences and Technology (HST), Harvard Medical School, Boston, MA 02115</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Bulyk, Martha L" sort="Bulyk, Martha L" uniqKey="Bulyk M" first="Martha L." last="Bulyk">Martha L. Bulyk</name>
<affiliation><nlm:aff id="AFF1">Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115,</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="AFF1">Committee on Higher Degrees in Biophysics, Harvard University, Cambridge, MA 02138,</nlm:aff>
</affiliation>
<affiliation><nlm:aff wicri:cut=" and" id="AFF1">Harvard/MIT Division of Health Sciences and Technology (HST), Harvard Medical School, Boston, MA 02115</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="AFF1">Department of Pathology, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Morris, Quaid D" sort="Morris, Quaid D" uniqKey="Morris Q" first="Quaid D." last="Morris">Quaid D. Morris</name>
<affiliation><nlm:aff id="AFF1">Department of Molecular Genetics,</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="AFF1">Banting and Best Department of Medical Research, University of Toronto, Toronto, ON M5S 3E1, Canada,</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Hughes, Timothy R" sort="Hughes, Timothy R" uniqKey="Hughes T" first="Timothy R." last="Hughes">Timothy R. Hughes</name>
<affiliation><nlm:aff id="AFF1">Department of Molecular Genetics,</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="AFF1">Banting and Best Department of Medical Research, University of Toronto, Toronto, ON M5S 3E1, Canada,</nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">Bioinformatics</title>
<idno type="ISSN">1367-4803</idno>
<idno type="eISSN">1460-2059</idno>
<imprint><date when="2008">2008</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><p><bold>Motivation:</bold>
 Recognition of specific DNA sequences is a central mechanism by which transcription factors (TFs) control gene expression. Many TF-binding preferences, however, are unknown or poorly characterized, in part due to the difficulty associated with determining their specificity experimentally, and an incomplete understanding of the mechanisms governing sequence specificity. New techniques that estimate the affinity of TFs to all possible <italic>k</italic>
-mers provide a new opportunity to study DNA–protein interaction mechanisms, and may facilitate inference of binding preferences for members of a given TF family when such information is available for other family members.</p>
<p><bold>Results:</bold>
 We employed a new dataset consisting of the relative preferences of mouse homeodomains for all eight-base DNA sequences in order to ask how well we can predict the binding profiles of homeodomains when only their protein sequences are given. We evaluated a panel of standard statistical inference techniques, as well as variations of the protein features considered. Nearest neighbour among functionally important residues emerged among the most effective methods. Our results underscore the complexity of TF–DNA recognition, and suggest a rational approach for future analyses of TF families.</p>
<p><bold>Contact:</bold>
 <email>t.hughes@utorotno.ca</email>
</p>
<p><bold>Supplementary information:</bold>
 <ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/btn645/DC1">Supplementary data</ext-link>
 are available at <italic>Bioinformatics</italic>
 online.</p>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article" xml:lang="EN"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">Bioinformatics</journal-id>
<journal-id journal-id-type="publisher-id">bioinformatics</journal-id>
<journal-id journal-id-type="hwp">bioinfo</journal-id>
<journal-title>Bioinformatics</journal-title>
<issn pub-type="ppub">1367-4803</issn>
<issn pub-type="epub">1460-2059</issn>
<publisher><publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">19088121</article-id>
<article-id pub-id-type="pmc">2666811</article-id>
<article-id pub-id-type="doi">10.1093/bioinformatics/btn645</article-id>
<article-id pub-id-type="publisher-id">btn645</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Original Papers</subject>
<subj-group><subject>Gene Expression</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group><article-title>Predicting the binding preference of transcription factors to individual DNA <italic>k</italic>
-mers</article-title>
</title-group>
<contrib-group><contrib contrib-type="author"><name><surname>Alleyne</surname>
<given-names>Trevis M.</given-names>
</name>
<xref ref-type="aff" rid="AFF1"><sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Peña-Castillo</surname>
<given-names>Lourdes</given-names>
</name>
<xref ref-type="aff" rid="AFF1"><sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Badis</surname>
<given-names>Gwenael</given-names>
</name>
<xref ref-type="aff" rid="AFF1"><sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Talukder</surname>
<given-names>Shaheynoor</given-names>
</name>
<xref ref-type="aff" rid="AFF1"><sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Berger</surname>
<given-names>Michael F.</given-names>
</name>
<xref ref-type="aff" rid="AFF1"><sup>3</sup>
</xref>
<xref ref-type="aff" rid="AFF1"><sup>4</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Gehrke</surname>
<given-names>Andrew R.</given-names>
</name>
<xref ref-type="aff" rid="AFF1"><sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Philippakis</surname>
<given-names>Anthony A.</given-names>
</name>
<xref ref-type="aff" rid="AFF1"><sup>3</sup>
</xref>
<xref ref-type="aff" rid="AFF1"><sup>4</sup>
</xref>
<xref ref-type="aff" rid="AFF1"><sup>5</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Bulyk</surname>
<given-names>Martha L.</given-names>
</name>
<xref ref-type="aff" rid="AFF1"><sup>3</sup>
</xref>
<xref ref-type="aff" rid="AFF1"><sup>4</sup>
</xref>
<xref ref-type="aff" rid="AFF1"><sup>5</sup>
</xref>
<xref ref-type="aff" rid="AFF1"><sup>6</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Morris</surname>
<given-names>Quaid D.</given-names>
</name>
<xref ref-type="aff" rid="AFF1"><sup>1</sup>
</xref>
<xref ref-type="aff" rid="AFF1"><sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Hughes</surname>
<given-names>Timothy R.</given-names>
</name>
<xref ref-type="aff" rid="AFF1"><sup>1</sup>
</xref>
<xref ref-type="aff" rid="AFF1"><sup>2</sup>
</xref>
<xref ref-type="corresp" rid="COR1">*</xref>
</contrib>
</contrib-group>
<aff id="AFF1"><sup>1</sup>
Department of Molecular Genetics,<sup>2</sup>
Banting and Best Department of Medical Research, University of Toronto, Toronto, ON M5S 3E1, Canada,<sup>3</sup>
Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115,<sup>4</sup>
Committee on Higher Degrees in Biophysics, Harvard University, Cambridge, MA 02138,<sup>5</sup>
Harvard/MIT Division of Health Sciences and Technology (HST), Harvard Medical School, Boston, MA 02115 and<sup>6</sup>
Department of Pathology, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA</aff>
<author-notes><corresp id="COR1">*To whom correspondence should be addressed.</corresp>
<fn><p>Associate Editor: David Rocke</p>
</fn>
</author-notes>
<pub-date pub-type="ppub"><day>15</day>
<month>4</month>
<year>2009</year>
</pub-date>
<pub-date pub-type="epub"><day>16</day>
<month>12</month>
<year>2008</year>
</pub-date>
<pub-date pub-type="pmc-release"><day>16</day>
<month>12</month>
<year>2008</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the 
 			. </pmc-comment>
      <volume>25</volume>
<issue>8</issue>
<fpage>1012</fpage>
<lpage>1018</lpage>
<history><date date-type="received"><day>10</day>
<month>8</month>
<year>2008</year>
</date>
<date date-type="rev-recd"><day>16</day>
<month>11</month>
<year>2008</year>
</date>
<date date-type="accepted"><day>11</day>
<month>12</month>
<year>2008</year>
</date>
</history>
<permissions><copyright-statement>© 2008 The Author(s)</copyright-statement>
<copyright-year>2008</copyright-year>
<license license-type="creative-commons" xlink:href="http://creativecommons.org/licenses/by-nc/2.0/uk/"><p><pmc-comment>CREATIVE COMMONS</pmc-comment>
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/2.0/uk/">http://creativecommons.org/licenses/by-nc/2.0/uk/</ext-link>
) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.</p>
</license>
</permissions>
<abstract><p><bold>Motivation:</bold>
 Recognition of specific DNA sequences is a central mechanism by which transcription factors (TFs) control gene expression. Many TF-binding preferences, however, are unknown or poorly characterized, in part due to the difficulty associated with determining their specificity experimentally, and an incomplete understanding of the mechanisms governing sequence specificity. New techniques that estimate the affinity of TFs to all possible <italic>k</italic>
-mers provide a new opportunity to study DNA–protein interaction mechanisms, and may facilitate inference of binding preferences for members of a given TF family when such information is available for other family members.</p>
<p><bold>Results:</bold>
 We employed a new dataset consisting of the relative preferences of mouse homeodomains for all eight-base DNA sequences in order to ask how well we can predict the binding profiles of homeodomains when only their protein sequences are given. We evaluated a panel of standard statistical inference techniques, as well as variations of the protein features considered. Nearest neighbour among functionally important residues emerged among the most effective methods. Our results underscore the complexity of TF–DNA recognition, and suggest a rational approach for future analyses of TF families.</p>
<p><bold>Contact:</bold>
 <email>t.hughes@utorotno.ca</email>
</p>
<p><bold>Supplementary information:</bold>
 <ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/btn645/DC1">Supplementary data</ext-link>
 are available at <italic>Bioinformatics</italic>
 online.</p>
</abstract>
</article-meta>
</front>
<body><sec sec-type="intro" id="SEC1"><title>1 INTRODUCTION</title>
<p>Most transcription factors (TFs) can be grouped into families of shared conserved DNA-binding structures that are usually identified by common ancestry inferred from sequence homology (Papavassiliou, <xref ref-type="bibr" rid="B22">1995</xref>
). Despite the sequence conservation within TF families, individual proteins within the same DNA-binding domain (DBD) family can have radically different DNA-binding specificities (Ekker <italic>et al.</italic>
, <xref ref-type="bibr" rid="B11">1994</xref>
). Since the preferred binding sequences within a family can often be changed by mutating only a single DNA-contacting amino acid residue (Damante <italic>et al.</italic>
, <xref ref-type="bibr" rid="B10">1996</xref>
), it has been proposed that a recognition code might exist in which affinity to each base in a TF binding site is governed by either additive or combinatorial rules that pair the identities of amino acids at DNA-contacting positions with relative preferences for each of the four DNA bases at each position of the binding site. Conflicting with this view, however, are observations that changes in DBD sequence can alter the arrangement of DNA-contacting residues in ways that seem to be inconsistent with a simple recognition code (Miller <italic>et al.</italic>
, <xref ref-type="bibr" rid="B18">2003</xref>
; Pabo and Nekludova, <xref ref-type="bibr" rid="B21">2000</xref>
). In addition, study of the DNA-binding specificities of TFs typically employs a position weight matrix (PWM) (Stormo, <xref ref-type="bibr" rid="B24">2000</xref>
), and the assumptions of PWMs, such as independence of base positions, do not fit all of the biochemical data (Benos <italic>et al.</italic>
, <xref ref-type="bibr" rid="B4">2002</xref>
).</p>
<p>Several high-throughput, unbiased and semi-quantitative methods for the assessment of TF sequence preferences have been developed, including protein-binding microarrays (PBM) (Mukherjee <italic>et al.</italic>
, <xref ref-type="bibr" rid="B20">2004</xref>
), DNA immunoprecipitation microarrays (DIP-chip) (Liu <italic>et al.</italic>
, <xref ref-type="bibr" rid="B17">2005</xref>
), and cognate site identifier (CSI) microarrays (Warren <italic>et al.</italic>
, <xref ref-type="bibr" rid="B27">2006</xref>
). The datasets associated with these methods provide an opportunity to examine protein–DNA interactions at previously unprecedented resolution and scale. Here, we present an evaluation of how well a panel of inference algorithms can predict TF DNA-binding specificity data derived from PBM experiments, in an effort to gain deeper insight into the mechanisms governing the specificity of protein–DNA interactions, and also to identify a means to project binding preferences to proteins without known binding preferences. We focus on the homeodomain family, because it is large and diverse, and the vast majority of homeodomain-containing proteins have only a single homeodomain. Homeodomains are also one of the most well-studied DBDs, both structurally and biochemically, such that the DNA-contacting residues are known (Kissinger <italic>et al.</italic>
, <xref ref-type="bibr" rid="B15">1990</xref>
) and several residues that can alter sequence specificity have been identified (Ades and Sauer, <xref ref-type="bibr" rid="B1">1994</xref>
; Ekker <italic>et al.</italic>
, <xref ref-type="bibr" rid="B11">1994</xref>
; Hanes and Brent, <xref ref-type="bibr" rid="B14">1989</xref>
). We find that a nearest neighbour (NN) approach using TF protein sequences is at least as effective as more sophisticated techniques. This finding has implications for the mechanics of DNA-binding, and for future study of TF–DNA interactions.</p>
</sec>
<sec sec-type="methods" id="SEC2"><title>2 METHODS</title>
<sec id="SEC2.1"><title>2.1 Dataset</title>
<p>The <italic>Z</italic>
-score transformed relative signal intensities for 168 homeodomains across all 32 896 8mer DNA sequences were obtained using PBMs (Berger <italic>et al.</italic>
, <xref ref-type="bibr" rid="B5">2008</xref>
). Given that methods that overfit the data may give good results in a leave-one-out cross-validation scheme if a high portion of the data has at least one nearly identical example, we reduced the dataset to 75 homeodomains unique at the 15 amino acid positions described as making contact with DNA in the engrailed crystal structure (<xref ref-type="table" rid="T1">Table 1</xref>
). A multiple sequence alignment of the 75 homeodomains was obtained by downloading the primary homeodomain family alignment from Pfam-A (Bateman <italic>et al.</italic>
, <xref ref-type="bibr" rid="B3">2004</xref>
) (Accession number PF00046) and extracting the pertinent sequences. From the resulting sequence alignment, three subset sequence alignments were derived for purposes of feature selection: all 57 residues of the Pfam alignment (positions 2–58 of the engrailed homeodomain), 15 residues described by Kissinger <italic>et al.</italic>
 (<xref ref-type="bibr" rid="B15">1990</xref>
) as making direct contact with DNA in the engrailed homeodomain crystal structure (positions 3, 5, 6, 25, 31, 44, 46, 47, 48, 50, 51, 53, 54, 55 and 57), and six residues described as determinants of sequence specificity in the literature (Ekker <italic>et al.</italic>
, <xref ref-type="bibr" rid="B11">1994</xref>
; Laughon, <xref ref-type="bibr" rid="B16">1991</xref>) (positions 3, 6, 7, 47, 50 and 54).
<table-wrap id="T1" position="float"><label>Table 1.</label>
<caption><p>List of 75 mouse homeodomains unique at 15 AA positions that contact DNA</p>
</caption>
<table frame="hsides" rules="groups"><tbody align="left"><tr><td rowspan="1" colspan="1">Alx3</td>
<td rowspan="1" colspan="1">Dobox4</td>
<td rowspan="1" colspan="1">Hlxb9</td>
<td rowspan="1" colspan="1">Hoxc12</td>
<td rowspan="1" colspan="1">Lhx6</td>
<td rowspan="1" colspan="1">Pax4</td>
<td rowspan="1" colspan="1">Rhox6</td>
</tr>
<tr><td rowspan="1" colspan="1">Bapx1</td>
<td rowspan="1" colspan="1">Dobox5</td>
<td rowspan="1" colspan="1">Hmbox1</td>
<td rowspan="1" colspan="1">Hoxc8</td>
<td rowspan="1" colspan="1">Meis1</td>
<td rowspan="1" colspan="1">Pax6</td>
<td rowspan="1" colspan="1">Six1</td>
</tr>
<tr><td rowspan="1" colspan="1">Barhl1</td>
<td rowspan="1" colspan="1">Duxl</td>
<td rowspan="1" colspan="1">Hmx1</td>
<td rowspan="1" colspan="1">Ipf1</td>
<td rowspan="1" colspan="1">Meox1</td>
<td rowspan="1" colspan="1">Pax7</td>
<td rowspan="1" colspan="1">Six3</td>
</tr>
<tr><td rowspan="1" colspan="1">Barx1</td>
<td rowspan="1" colspan="1">Emx2</td>
<td rowspan="1" colspan="1">Hmx2</td>
<td rowspan="1" colspan="1">Irx2</td>
<td rowspan="1" colspan="1">Msx1</td>
<td rowspan="1" colspan="1">Pbx1</td>
<td rowspan="1" colspan="1">Six4</td>
</tr>
<tr><td rowspan="1" colspan="1">Bsx</td>
<td rowspan="1" colspan="1">En1</td>
<td rowspan="1" colspan="1">Homez</td>
<td rowspan="1" colspan="1">Irx3</td>
<td rowspan="1" colspan="1">Nkx1-1</td>
<td rowspan="1" colspan="1">Pitx1</td>
<td rowspan="1" colspan="1">Tcf1</td>
</tr>
<tr><td rowspan="1" colspan="1">Cdx1</td>
<td rowspan="1" colspan="1">Esx1</td>
<td rowspan="1" colspan="1">Hoxa1</td>
<td rowspan="1" colspan="1">Isl2</td>
<td rowspan="1" colspan="1">Nkx2-2</td>
<td rowspan="1" colspan="1">Pknox1</td>
<td rowspan="1" colspan="1">Tcf2</td>
</tr>
<tr><td rowspan="1" colspan="1">Cphx</td>
<td rowspan="1" colspan="1">Evx1</td>
<td rowspan="1" colspan="1">Hoxa10</td>
<td rowspan="1" colspan="1">Isx</td>
<td rowspan="1" colspan="1">Nkx6-1</td>
<td rowspan="1" colspan="1">Pou1f1</td>
<td rowspan="1" colspan="1">Tgif1</td>
</tr>
<tr><td rowspan="1" colspan="1">Crx</td>
<td rowspan="1" colspan="1">Gsc</td>
<td rowspan="1" colspan="1">Hoxa13</td>
<td rowspan="1" colspan="1">Lbx2</td>
<td rowspan="1" colspan="1">Obox1</td>
<td rowspan="1" colspan="1">Pou2f1</td>
<td rowspan="1" colspan="1">Tgif2</td>
</tr>
<tr><td rowspan="1" colspan="1">Cutl1</td>
<td rowspan="1" colspan="1">Gsh2</td>
<td rowspan="1" colspan="1">Hoxa2</td>
<td rowspan="1" colspan="1">Lhx1</td>
<td rowspan="1" colspan="1">Obox6</td>
<td rowspan="1" colspan="1">Pou4f3</td>
<td rowspan="1" colspan="1">Tlx2</td>
</tr>
<tr><td rowspan="1" colspan="1">Dbx1</td>
<td rowspan="1" colspan="1">Hdx</td>
<td rowspan="1" colspan="1">Hoxa6</td>
<td rowspan="1" colspan="1">Lhx2</td>
<td rowspan="1" colspan="1">Og2x</td>
<td rowspan="1" colspan="1">Pou6f1</td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr><td rowspan="1" colspan="1">Dlx1</td>
<td rowspan="1" colspan="1">Hlx1</td>
<td rowspan="1" colspan="1">Hoxb13</td>
<td rowspan="1" colspan="1">Lhx3</td>
<td rowspan="1" colspan="1">Otp</td>
<td rowspan="1" colspan="1">Rhox11</td>
<td rowspan="1" colspan="1"></td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<sec id="SEC2.1.1"><title>2.1.1 Numerical encoding</title>
<p>All implementations of the compared methods, except NN, required numerical inputs. We converted the 6-, 15- and 57-position sequence alignments to numerical encodings representing amino acid sequences of length <italic>l</italic>
 as binary vectors of length <italic>l</italic>
×20 digits, i.e. the 20 different amino acids were encoded as orthogonal 20 digit vectors and an amino acid sequence was represented by concatenating the binary vectors corresponding to residues at each position. Gaps were encoded as a vector of 20 zeros. Insertions were not considered in this analysis.</p>
</sec>
</sec>
<sec id="SEC2.2"><title>2.2 Machine learning algorithms</title>
<p>Let <italic>x</italic>
<sub>1</sub>
, <italic>x</italic>
<sub>2</sub>
,…, <italic>x</italic>
<sub><italic>n</italic>
</sub>
 be the set of <italic>m</italic>
-residue sequence alignments from the dataset described above, where <italic>m</italic>
=6, 15 or 57, and for a given 8mer out of the <italic>t</italic>
 total exemplar 8mers, let <italic>y</italic>
<sub><italic>i</italic>
</sub>
 be the <italic>Z</italic>
-scores for the <italic>i</italic>
-th protein with respect to that 8mer. We defined the problem of predicting homeodomain <italic>Z</italic>
-scores for a particular 8mer as the estimation of the function <italic>f</italic>
 : <italic>x</italic>
→ℝ trained using the <italic>n</italic>
 data pairs (<italic>x</italic>
<sub>1</sub>
, <italic>y</italic>
<sub>1</sub>
),…, (<italic>x</italic>
<sub><italic>n</italic>
</sub>
, <italic>y</italic>
<sub><italic>n</italic>
</sub>
)∈χ×ℝ, such that <italic>f</italic>
(<italic>x</italic>
<sub><italic>i</italic>
</sub>
) is approximately equal to <italic>y</italic>
<sub><italic>i</italic>
</sub>
 and <italic>f</italic>
 correctly generalizes to most unseen examples; therefore the problem of predicting homeodomain 8mer <italic>Z</italic>
-score profiles across all 8mers was defined as predicting all <italic>t</italic>
 such functions. In this case, <italic>x</italic>
 was the set of sequences {A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V, –}<sup><italic>m</italic>
</sup>
, where ‘–’ represents a sequence alignment gap. We formalized both definitions as multiple regression problems in which the <italic>x</italic>
<sub><italic>i</italic>
</sub>
 were considered as <italic>n</italic>
 observations on <italic>m</italic>
 predictor variables and the <italic>y</italic>
<sub><italic>i</italic>
</sub>
 were considered as <italic>n</italic>
 observations on a response variable, and accordingly, compared a number of regression techniques from machine learning and statistics (outlined below) for the purpose of quantitatively modelling the relationships between these variables.</p>
<sec id="SEC2.2.1"><title>2.2.1 Nearest Neighbour</title>
<p>Assume that <italic>x</italic>
 is the length <italic>m</italic>
 amino acid sequence alignment of an unseen protein. In order to predict the 8mer profile of <italic>x</italic>
, our implementation of the NN algorithm calculates a vector <inline-formula><inline-graphic xlink:href="btn645i1.jpg"></inline-graphic>
</inline-formula>
 of distances, where each element <inline-formula><inline-graphic xlink:href="btn645i2.jpg"></inline-graphic>
</inline-formula>
 represents the distance <italic>d</italic>
(<italic>x</italic>
<sub><italic>i</italic>
</sub>
, <italic>x</italic>
) between protein <italic>x</italic>
 and <italic>x</italic>
<sub><italic>i</italic>
</sub>
(<italic>i</italic>
∈{1,…, <italic>n</italic>
}). We defined the distance between two proteins as the proportion of non-identities across all <italic>m</italic>
 positions. We also tested distances based on the PAM250 matrix, but the results were inferior (Berger <italic>et al.</italic>
, <xref ref-type="bibr" rid="B5">2008</xref>
). Using <inline-formula><inline-graphic xlink:href="btn645i3.jpg"></inline-graphic>
</inline-formula>
, the algorithm then finds the NNs of <italic>x</italic>
 by computing the set <inline-formula><inline-graphic xlink:href="btn645i4.jpg"></inline-graphic>
</inline-formula>
. Finally, for all <italic>t</italic>
 8mers, the algorithm calculates the <italic>Z</italic>
-score of each 8mer as the mean of the <italic>Z</italic>
-scores for that 8mer across all of the NNs.</p>
</sec>
<sec id="SEC2.2.2"><title>2.2.2 Random forests regression</title>
<p>We used the R randomForest package, which serves as an interface to the original random forests (RF) Fortran code developed by Breiman and Cutler (available at <ext-link ext-link-type="uri" xlink:href="http://www.stat.berkeley.edu/∼breiman/RandomForests">http://www.stat.berkeley.edu/∼breiman/RandomForests</ext-link>
). To predict the 8mer profile of an unseen protein <italic>x</italic>
, we generated <italic>t</italic>
 RF, by using the set of <italic>n</italic>
 observations on <italic>m</italic>
 predictors <italic>x</italic>
<sub><italic>i</italic>
</sub>
, the response variable <italic>y</italic>
 for a given 8mer, and default parameters. We then used this collection of RF to predict the <italic>Z</italic>
-scores across all 8mers for the sequence <italic>x</italic>
.</p>
</sec>
<sec id="SEC2.2.3"><title>2.2.3 Support vector regression</title>
<p>We used the LIBSVM package developed by Chih-Chung Chang and Chih-Jen Lin (available at <ext-link ext-link-type="uri" xlink:href="http://www.csie.ntu.edu.tw/∼cjlin/libsvm/">http://www.csie.ntu.edu.tw/∼cjlin/libsvm/</ext-link>) to construct SVMs for every exemplar 8mer. For each 8mer, three SVMs were constructed, each using a different kernel: the linear kernel (SVM_L),
<disp-formula><graphic xlink:href="btn645um1.jpg" position="float"></graphic>
</disp-formula>the polynomial kernel (SVM_P),
<disp-formula><graphic xlink:href="btn645um2.jpg" position="float"></graphic>
</disp-formula>or the radial basis function kernel (SVM_R),
<disp-formula><graphic xlink:href="btn645um3.jpg" position="float"></graphic>
</disp-formula>
where <italic>d</italic>
∈ℕ, γ>0, <italic>x</italic>
 and <italic>x</italic>
′ are alignment encodings, and <<italic>x</italic>
, <italic>x</italic>
′> refers to the inner product. All parameters were left at default setting with the following exceptions. For SVM_L, we tried all parameter pairs [ɛ, <italic>C</italic>
]={ɛ, <italic>C</italic>
|0.1≤ɛ≤4.8, 2<sup>−15</sup>
≤<italic>C</italic>
≤2<sup>3</sup>
}, where ɛ is the epsilon-SVM precision parameter, which was varied in steps of 0.8, and <italic>C</italic>
 is the SVM error penalty parameter. For SVM_P, we tried all parameter pairs [<italic>d</italic>
, <italic>C</italic>
]={<italic>d</italic>
, <italic>C</italic>
|1≤<italic>d</italic>
≤6, 2<sup>−15</sup>
≤<italic>C</italic>
≤2<sup>3</sup>
}, where <italic>d</italic>
 was varied in steps of 1. For SVM_R, we tried all parameter pairs [γ, <italic>C</italic>
]={γ, <italic>C</italic>
|2<sup>−15</sup>
≤γ≤2<sup>3</sup>
, 2<sup>−15</sup>
≤<italic>C</italic>
≤2<sup>3</sup>
}, where γ was varied by a factor of 2<sup>2</sup>
. In all cases, <italic>C</italic>
 was varied by a factor of 2<sup>2</sup>
 and the best parameter pair was chosen using 5-fold cross-validation.</p>
</sec>
<sec id="SEC2.2.4"><title>2.2.4 Principal components regression</title>
<p>As the encoding strategy that we used produces a much larger number of variables relative to the number of samples (rank deficiency) as well as a large number of correlated variables (multicollinearity), both of which are problematic for linear regression, we used principal components regression (PCR) to simultaneously reduce the dimensionality of the encodings and remove the correlation between variables. PCR was carried out by first applying principal components analysis to the encodings. The number of principal components retained <italic>p</italic>
 was selected using parallel analysis (PA) with 1000 shuffles, which is essentially a permutation test that asks whether the <italic>N</italic>
-th principal component explains more of the variance than the <italic>N</italic>
-th principal component would in a permuted version of the same data [reviewed in reference Franklin <italic>et al.</italic>
 (<xref ref-type="bibr" rid="B12">1995</xref>)]. On the basis of PA, we retained 6, 12 and 19 principal components for the 6-, 15- and 57-position alignments, respectively. For each 8mer, we then built a regression model using an approach similar to 5-fold cross-validation, described as follows:
<list list-type="order"><list-item><p>randomly partition the sample set into five subsamples;</p>
</list-item>
<list-item><p>retain one subsample as the validation set and aggregate the remaining <italic>k</italic>
=4 subsamples into a matrix of training data, <italic>t</italic>
<sub><italic>ij</italic>
</sub>
(<italic>i</italic>
∈{1,…, <italic>k</italic>
}, <italic>j</italic>
∈{1,…, <italic>p</italic>
}) so that the intercept in the regression model will always be estimated by <inline-formula><inline-graphic xlink:href="btn645i5.jpg"></inline-graphic>
</inline-formula>
 (Montgomery and Runger, <xref ref-type="bibr" rid="B19">2007</xref>
);</p>
</list-item>
<list-item><p>centre and transform the training data into a new set of variables as;
<disp-formula><graphic xlink:href="btn645um4.jpg" position="float"></graphic>
</disp-formula>
where, <inline-formula><inline-graphic xlink:href="btn645i6.jpg"></inline-graphic>
</inline-formula>
.</p>
</list-item>
<list-item><p>compute the ordinary least squares coefficients for the transformed training data and calculate the mean squared error (MSE) of the coefficients using the validation set;</p>
</list-item>
<list-item><p>go back to Step 2 until all subsamples have been used as the validation set and retain the coefficients with the lowest MSE;</p>
</list-item>
<list-item><p>repeat Steps 1–5 three times.</p>
</list-item>
</list>
</p>
</sec>
</sec>
</sec>
<sec sec-type="results" id="SEC3"><title>3 RESULTS</title>
<sec id="SEC3.1"><title>3.1 Comparison of linear and non-linear inference methods</title>
<p>We attempted to learn the <italic>Z</italic>
-score transformed signal intensities for mouse homeodomain DBDs for all 32 896 non-redundant eight-base DNA sequences using PBM experiments (Berger <italic>et al.</italic>
, <xref ref-type="bibr" rid="B5">2008</xref>
). We learned the <italic>Z</italic>
-scores rather than PWMs because <italic>Z</italic>
-scores reflect binding affinity (Berger <italic>et al.</italic>
, <xref ref-type="bibr" rid="B6">2006</xref>
), whereas PWMs often fail to capture detailed binding activity (Benos <italic>et al.</italic>
, <xref ref-type="bibr" rid="B4">2002</xref>
; Chen <italic>et al.</italic>
, <xref ref-type="bibr" rid="B8">2007</xref>
) and cannot be aligned with confidence for many homeodomains (Berger <italic>et al.</italic>
, <xref ref-type="bibr" rid="B5">2008</xref>
), complicating direct comparisons. To avoid overfitting, we considered a 75-homeodomain subset in which each protein is unique at the 15 amino acid positions described as making contact with DNA in the engrailed crystal structure (Kissinger <italic>et al.</italic>
, <xref ref-type="bibr" rid="B15">1990</xref>
) (<xref ref-type="table" rid="T1">Table 1</xref>
), as we have previously shown that a perfect match at all 15 amino acids yields data comparable to experimental replicates of a single homeodomain (Berger <italic>et al.</italic>
, <xref ref-type="bibr" rid="B5">2008</xref>
). All original datasets and <ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/btn645/DC1">Supplementary Material</ext-link>
 can be downloaded from <ext-link ext-link-type="uri" xlink:href="http://hugheslab.ccbr.utoronto.ca/supplementary-data/profile_prediction/">http://hugheslab.ccbr.utoronto.ca/supplementary-data/profile_prediction/</ext-link>
.</p>
<p>We assessed the performance of a panel of inference algorithms by a leave-one-out cross-validation approach, in which each of the 75 homeodomains was held out from the training set in turn and the remaining proteins were used as training data to predict the <italic>Z</italic>
-score profile of the held-out protein, given its amino acid sequence. We used regression to create linear models via PCR and linear kernel support vector regression (SVM_L). To create models in which interactions between TF sequence features can be captured, reflecting ‘combinatorial recognition codes’ (Damante <italic>et al.</italic>
, <xref ref-type="bibr" rid="B10">1996</xref>
), we also used support vector regression with a polynomial kernel (SVM_P), or radial basis function kernel (SVM_R), RF (Breiman, <xref ref-type="bibr" rid="B7">2001</xref>
) and a NN approach in which the profile of a held-out protein was predicted as the averaged profiles of its nearest (fewest mismatches) sequence neighbour(s) in the training set. With the exception of the NN method, amino acid sequences of length <italic>l</italic>
 were numerically represented as binary vectors of length <italic>l</italic>
 × 20 digits, i.e. the 20 different amino acids were encoded as orthogonal 20 digit vectors and each protein sequence was represented by concatenating the binary vectors corresponding to residues at each position.</p>
<p>In each of these analyses we also considered three sets of features: (i) the full 57 amino acid homeodomain (omitting insertions), (ii) the subset of 15 amino acids that contact the DNA in the engrailed structure (positions 3, 5, 6, 25, 31, 44, 46, 47, 48, 50, 51, 53, 54, 55 and 57) (Kissinger <italic>et al.</italic>
, <xref ref-type="bibr" rid="B15">1990</xref>
) and (iii) six amino acids that have been demonstrated to influence binding preferences (positions 3, 6, 7, 47, 50 and 54) (Ekker <italic>et al.</italic>
, <xref ref-type="bibr" rid="B11">1994</xref>
; Laughon, <xref ref-type="bibr" rid="B16">1991</xref>
) (referred to as 6AA, 15AA and 57AA). We did not consider <italic>de novo</italic>
 feature selection as part of our training process because feature selection consumes statistical (i.e. training) power, and arbitrary feature selection is nondeterministic polynomial-time hard (NP) in the general case (Garey and Johnson, <xref ref-type="bibr" rid="B13">1979</xref>
). In <xref ref-type="sec" rid="SEC3.3">Section 3.3</xref>
, we present evidence that residues scored highest by the RF importance score may be less predictive than literature-derived feature sets.</p>
</sec>
<sec id="SEC3.2"><title>3.2 Assessing the performance of inference methods</title>
<p>The cross-validation results were assessed using three measures: (i) the number of top-100 8mers in common, (ii) Spearman correlation over all 8mers and (iii) overall RMSE (root mean squared error) values between the predicted and the actual <italic>Z</italic>
-score profiles over all 8mers. As a summary statistic, we also counted the number of proteins with a top-100 overlap <50. As a background control, we calculated the difference between the median of each metric and the median of the performance of all predicted versus all actual profiles, since all homeodomain-binding profiles correlate to a degree. Results are tallied in <xref ref-type="table" rid="T2">Table 2</xref>
, which is sorted from best to worst median rank across all of the criteria. Included in <xref ref-type="table" rid="T2">Table 2</xref>
 is the agreement between 19 experimental replicates as a reference for the reproducibility of the assay itself (Berger <italic>et al.</italic>
, <xref ref-type="bibr" rid="B5">2008</xref>). In these replicates, 19 different homeodomains were each analyzed in duplicate, and the numbers reported refer to 19 pairwise comparisons. Since the set of replicates contains some homeodomains not found in the 75 that we analyzed, however, the performance values cannot be directly compared with those of the predictions.
<table-wrap id="T2" position="float"><label>Table 2.</label>
<caption><p>Leave-one-out cross-validation measures for 8mer <italic>Z</italic>
-score profile prediction algorithms on 32 896 8mers for 75 homeodomains</p>
</caption>
<table frame="hsides" rules="groups"><tbody align="left"><tr><td rowspan="1" colspan="1"><inline-graphic xlink:href="btn645if1.jpg"></inline-graphic>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot><fn><p>Algorithms are sorted in descending order of median rank across all columns, where ties are resolved using mean rank. The first row shows the agreement between 19 experimental replicates and their corresponding true <italic>Z</italic>
-score profiles as measured using PBM. Columns labelled ‘predicted versus real’ show the mean or median performance between each predicted profile and its true, measured <italic>Z</italic>
-score profile. Columns labelled ‘control’ show the difference between the median predicted versus real performance and the median of the performance between all pairs of predicted and actual profiles. Cells in a given column are coloured according to their position in the range of that column. Rows labelled top6 and top15 represent the result obtained if we use the 6 and 15 most important amino acid positions according to the RF importance score on the 57AA set.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
<p>Three major conclusions can be drawn from this analysis. First, results of all algorithms are clearly distinct from random (<xref ref-type="table" rid="T2">Table 2</xref>
, columns 5, 9 and 12). Second, the 15AA and 6AA subsets appear to provide a superior training set relative to the 57AA set. Third, presumably due to the importance of non-linear interactions between amino acid positions in defining DNA-binding specificity, methods that can capture interactions and non-linearities have a clear advantage: there is almost always at least one variant of each non-linear method, i.e. NN, RF and SVM_R, that outperforms every linear method we employed. NN (<xref ref-type="fig" rid="F1">Fig. 1</xref>
, right panel), in particular, has a significantly higher mean top-100 overlap than PCR [95% confidence interval (CI) for difference, 3.92–109; Kruskal–Wallis test]. Moreover, NN often shows the greatest difference from random, and has the fewest predicted profiles with a top-100 overlap<50. In three instances (Evx1, Irx2, and Lhx1), the 15AA NN-predicted <italic>Z</italic>
-score profiles exhibit Spearman correlation, top-100 overlap, or RMSE values that exceed those of the experimental replicates for these proteins. Therefore, it appears that predicted <italic>Z</italic>
-score profiles can, in specific cases, rival experimental replicates in reproducing the <italic>Z</italic>
-score profile of a given homeodomain.<xref ref-type="fig" rid="F2">Figure 2</xref>
 shows scatter plots of the <italic>Z</italic>
-scores for Evx1, Irx2 and Lhx1, as compared with predicted and replicate <italic>Z</italic>-scores.
<fig id="F1" position="float"><label>Fig. 1.</label>
<caption><p>2D clustergram of <italic>Z</italic>
-scores for 2042 8mers and 75 mouse homeodomains, as observed in either real PBM data (left) or NN predictions (right), with some of the established classes of homeodomains labelled. NN predictions were made using 6AA positions and leave-one-out cross-validation. The 2042 8mers were selected because they comprise the top 100 8mers by <italic>Z</italic>
-score over the DBDs shown.</p>
</caption>
<graphic xlink:href="btn645f1"></graphic>
</fig>
<fig id="F2" position="float"><label>Fig. 2.</label>
<caption><p>Comparison of the accuracy of NN predictions versus experimental replicates. Scatterplots show the measured <italic>Z</italic>
-scores for all 32 896 non-redundant eight-base DNA sequences from one PBM versus a second PBM for the same DBD (top) or versus the <italic>Z</italic>
-score predicted using NN (6AA variant; bottom). Median performance metrics are given. Evx1 has a single NN (Hoxa2); Irx2 has a single NN (Irx3); Lhx1 has two NN (Alx3 and Lhx3).</p>
</caption>
<graphic xlink:href="btn645f2"></graphic>
</fig>
</p>
</sec>
<sec id="SEC3.3"><title>3.3 <italic>De novo</italic>
 feature selection</title>
<p>The feature sets we used were chosen on the basis of biochemical and genetic experiments to ask whether the use of this prior data to select features reduces generalization error. It is also of interest whether automated feature selection identifies the same residues, and whether automatically selected features perform better than those selected using evidence from laboratory studies. Towards this end, we examined the ‘node purity’ importance scores output by RF run with the full 57AA set. We summarized the importance per residue for each by considering the median importance score for the 2585 8mers reported by Berger <italic>et al.</italic>
 (<xref ref-type="bibr" rid="B5">2008</xref>
) to be bound in at least one experiment using the <italic>E</italic>
>0.45 criterion, reasoning that RF may be learning primarily noise for the remaining 8mers. A very similar set of importance scores emerged from each of the 75-rounds of cross-validation (<xref ref-type="fig" rid="F3">Fig. 3</xref>
). Considering the median importance score over all homeodomains over all 2585 residues as a feature prioritization measure, we obtained the ranking of residues shown in <xref ref-type="fig" rid="F3">Figure 3</xref>
. The top 15AA emerging from this analysis are (in descending order) 50, 6, 46, 54, 7, 56, 14, 28, 4, 19, 43, 22, 29, 36, 37. These residues include only four among our 6AA set (6, 7, 50, 54) and four among our 15AA set (6, 46, 50, 54). Thus, <italic>de novo</italic>
 feature selection identifies some, but not all, of the same residues as laboratory studies. We found that the top-6 and top-15 residues selected by the RF importance score did not perform as well in NN (our best performing method) as did the original 6AA and 15AA sets (<xref ref-type="table" rid="T2">Table 2</xref>
). A possible explanation is that <italic>de novo</italic>
 feature selection is identifying residues that correlate with binding specificity, but without being causative; for example, residues that participate in functions of the homeodomains besides DNA binding, those that are shared due to common evolutionary descent, and/or those that co-vary due to structural constraints (Clarke, <xref ref-type="bibr" rid="B9">1995</xref>
). From these results, and the fact that the 6AA and 15AA sets generally provide better features (<xref ref-type="table" rid="T2">Table 2</xref>), we propose that use of experimental evidence in the feature selection step can augment training power, by incorporating external information.
<fig id="F3" position="float"><label>Fig. 3.</label>
<caption><p>Node purity importance scores for 57 homeodomain amino acid positions for 75 rounds of leave-one-out cross-validation, sorted by median value (purple).</p>
</caption>
<graphic xlink:href="btn645f3"></graphic>
</fig>
</p>
</sec>
<sec id="SEC3.4"><title>3.4 Association between prediction difficulty and number of sequence mismatches</title>
<p>In general, the 8mer profiles that are difficult for one algorithm to predict are those that are difficult for other algorithms as well. <xref ref-type="fig" rid="F4">Figure 4</xref> compares the top-100 overlap for all 75 homeodomains for all prediction methods, using the 15AA feature set. The colours of the points reflect the NN distance. There is a clear relationship between the 15AA distance and the top-100 overlap, with the 10 proteins with the greatest distance consistently having overlaps <50, indicating that for all methods the difficulty of learning the 8mer profile for a specific experiment is related to whether there is a similar example in the training set. This trend also holds for other feature sets, and likely explains the success of NN, which does not incorporate any information from more distant profiles.
<fig id="F4" position="float"><label>Fig. 4.</label>
<caption><p>Association between top-100 overlap scores for pairs of 8mer profile inference methods. Scatterplots show the top-100 overlap values for 75 homeodomains when <italic>Z</italic>
-score profiles are predicted using one inference method versus another method for the same proteins. All axes range from 0 to 100. The names on the diagonal label the axes. Predictions are made using the 15 homeodomain DNA-contacting residues. Homeodomains are coloured according to whether they have ≥slant 5 (red), 3–4 (blue) or 1–2 (green) mismatches to their nearest sequence neighbour.</p>
</caption>
<graphic xlink:href="btn645f4"></graphic>
</fig>
</p>
</sec>
</sec>
<sec sec-type="discussion" id="SEC4"><title>4 DISCUSSION</title>
<p>Our results show that the full DNA-binding specificity of uncharacterized TFs to individual <italic>k</italic>
-mers can be predicted on the basis of similarity in protein sequence alone, given the sequence specificity of closely related members of the same TF family, and (preferably) knowledge of the DNA-contacting residues. Our results are likely to underestimate real-world accuracy because we only evaluated homeodomains that are unique at 15 DNA-contacting amino acids. The efficacy of NN makes predicting binding preferences simple to implement and consistent with intuition: it is typically assumed that similarity among functional residues reflects similar protein activity. At least one previous study applied a NN strategy to the inference of PWMs (Qian <italic>et al.</italic>
, <xref ref-type="bibr" rid="B23">2007</xref>
), but our NN implementation is more straightforward and provides relative affinity estimates for individual sequences: in contrast, the approach described by Qian <italic>et al.</italic>
 predicts the consensus motifs of TF binding sites from the TRANSFAC database using the InterPro annotations of the TF of interest and its target genes as training data.</p>
<p>Our results are consistent with the ‘combinatorial code’ model of TF binding (Damante <italic>et al.</italic>
, <xref ref-type="bibr" rid="B10">1996</xref>
; Suzuki and Yagi, <xref ref-type="bibr" rid="B26">1994</xref>
; Suzuki <italic>et al.</italic>
, <xref ref-type="bibr" rid="B25">1995</xref>
). In this model, the relative preference of a TF to individual bases in a given DNA sequence is determined by the aggregate identities of a subset of key amino acid residues. In our regime, this model would translate into interaction terms among amino acid residues. Indeed, in our analysis, methods capable of modelling interactions between amino acid positions, such as NN and RF, appear to be best suited to predicting sequence preferences for TFs, or at least for homeodomains. The fact that linear regression is one of the least effective methods among those tested further supports the importance of interaction terms; preferences to individual DNA sequences apparently cannot be taken as a linear combination of the contributions of each amino acid residue.</p>
<p>In addition, the observation that incorporation of the full set of homeodomain residues adversely affects all success measures that were employed here, even using NN (which would not be subject to overfitting), is consistent with a model in which the remainder of the domain structure primarily plays a role as a scaffold, at least with regard to DNA binding. This is because such a role would provide flexibility in residue identities without impacting DNA sequence specificity.</p>
<p>An important question is whether the outcome of our comparisons would be different with different feature sets, and whether our results could be improved with more sophisticated approaches. With regard to feature sets, even in a circular regime (selecting amino acids using the same data used to test them) we found no feature sets that offered a substantial improvement over the 6AA and 15AA sets (<xref ref-type="fig" rid="F3">Fig. 3</xref>
, <xref ref-type="table" rid="T2">Table 2</xref>
 and data not shown), suggesting that a single DNA–protein co-crystal structure constitutes a powerful feature selection step, perhaps because it provides information that is not available to the algorithms used here. Nonetheless, it is possible that addition of an automated feature selection step might be advantageous, particularly if it is incorporated into the cross-validation regime, i.e. if the feature selection is done separately at each LOO iteration, and/or if it is done in conjunction with feature selection based on experiments. Due to the large number of permutations, we did not explore such variations in this study, nor did we test every possible variation of the techniques represented. For example, it has been reported that pruned decision trees usually perform better than unpruned trees; this was not an option in the RF implementation that we used but would be worth examining. It may also be beneficial in the future to take advantage of similarity among <italic>k</italic>
-mers. In all of the analyses presented here, each <italic>k</italic>
-mer is treated as a separate learning problem; however, there are relationships among the <italic>k</italic>
-mers in both sequence and affinity for individual proteins. Exploration of these variations could shed light on the biology of DNA binding in addition to improving prediction results. We note, however, that there are also benefits associated with use of simple inference methods, such as NN. While performing as well as other methods, NN is computationally much less intensive than any other method we tested; in our algorithm, NN is determined based on protein sequence alone, so the time complexity of this part of the algorithm does not depend upon the number of <italic>k</italic>
-mers. Also, the success of NN on the full set of <italic>k</italic>
-mer affinities suggests that our NN approach would also work well when the binding preferences of each TF were represented differently, e.g. as a PWM.</p>
<p>Another question is whether better results could be obtained using a training set that more completely samples possible homeodomain amino acid combinations. With regard to sampling depth in the training set, the argument may be academic: the large number of possible combinations would be impractical to survey in the laboratory, and also appears to be sparsely populated in nature (data not shown) (Berger <italic>et al.</italic>
, <xref ref-type="bibr" rid="B5">2008</xref>
). A more extensive PBM-based survey of the binding preferences of naturally occurring unique combinations among DNA-contacting residues might be the next step towards both theoretical and practical aims. Such a survey would also help to clarify the functional evolution of the distinct homeodomain subclasses. One interpretation of the success of NN—coupled with the fact that all algorithms suffer considerably when there is no similar homeodomain in the training set to serve as an example—is that homeodomain groups [described in Banerjee-Basu and Baxevanis (<xref ref-type="bibr" rid="B2">2001</xref>
), although the groups we obtained are not always identical] each have distinct DNA-binding modes that cannot be inferred from examples in other groups. Consistent with this notion, there is a strong correspondence between 8mer binding profiles and sequence groups obtained by ClustalW (data not shown). In fact, we cannot rule out that RF and/or SVM_R are acting in essence as a more sophisticated version of NN, by learning group memberships. We have attempted to improve upon our current results using unsupervised sequence clustering approaches, but have not yet been able to improve upon the NN results (data not shown). One explanation for this outcome may be that there are no ideal natural subdivisions within these groups; instead, there is variation on a theme within each group, and the variation in amino acid sequence bears a relationship to that seen in the 8mer binding profiles. If this is the case, then even better inference results might be obtained from a two-stage process in which group assignment is separated from <italic>k</italic>
-mer profile prediction within a group.</p>
<p>Finally, our preliminary results (data not shown) suggest that NN will be similarly applicable to other DBD classes. Extension of the work presented here should allow future experimental studies of binding specificity to focus on proteins most likely to possess new DNA-binding activities, and will facilitate more accurate inference of DNA-binding data among proteins with related sequences.</p>
</sec>
</body>
<back><ack><title>ACKNOWLEDGEMENTS</title>
<p>We are grateful to Harm van Bakel and Jeff Liu for maintenance of computational infrastructure. We thank Gary Bader and Alan Davidson for helpful discussions.</p>
<p><italic>Funding</italic>
: T.M.A. was supported by an Ontario Graduate Scholarship and funding from Howard Hughes Medical Institute (HHMI) to T.R.H. Generation of the experimental data analyzed was supported by grants to T.R.H. and M.L.B. from CIHR, Genome Canada through the Ontario Genomics Institute, the Ontario Research Fund and National Institutes of Health (NIH)/National Human Genome Research Institute (NHGRI).</p>
<p><italic>Conflict of Interest</italic>
: none declared.</p>
</ack>
<ref-list><title>REFERENCES</title>
<ref id="B1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ades</surname>
<given-names>SE</given-names>
</name>
<name><surname>Sauer</surname>
<given-names>RT</given-names>
</name>
</person-group>
<article-title>Differential DNA-binding specificity of the engrailed homeodomain: the role of residue 50</article-title>
<source>Biochemistry</source>
<year>1994</year>
<volume>33</volume>
<fpage>9187</fpage>
<lpage>9194</lpage>
<pub-id pub-id-type="pmid">8049221</pub-id>
</citation>
</ref>
<ref id="B2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Banerjee-Basu</surname>
<given-names>S</given-names>
</name>
<name><surname>Baxevanis</surname>
<given-names>AD</given-names>
</name>
</person-group>
<article-title>Molecular evolution of the homeodomain family of transcription factors</article-title>
<source>Nucleic Acids Res.</source>
<year>2001</year>
<volume>29</volume>
<fpage>3258</fpage>
<lpage>3269</lpage>
<pub-id pub-id-type="pmid">11470884</pub-id>
</citation>
</ref>
<ref id="B3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bateman</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The Pfam protein families database</article-title>
<source>Nucleic Acids Res.</source>
<year>2004</year>
<volume>32</volume>
<fpage>D138</fpage>
<lpage>D141</lpage>
<pub-id pub-id-type="pmid">14681378</pub-id>
</citation>
</ref>
<ref id="B4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Benos</surname>
<given-names>PV</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Additivity in protein-DNA interactions: how good an approximation is it?</article-title>
<source>Nucleic Acids Res.</source>
<year>2002</year>
<volume>30</volume>
<fpage>4442</fpage>
<lpage>4451</lpage>
<pub-id pub-id-type="pmid">12384591</pub-id>
</citation>
</ref>
<ref id="B5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Berger</surname>
<given-names>MF</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences</article-title>
<source>Cell</source>
<year>2008</year>
<volume>133</volume>
<fpage>1266</fpage>
<lpage>1276</lpage>
<pub-id pub-id-type="pmid">18585359</pub-id>
</citation>
</ref>
<ref id="B6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Berger</surname>
<given-names>MF</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities</article-title>
<source>Nat. Biotechnol.</source>
<year>2006</year>
<volume>24</volume>
<fpage>1429</fpage>
<lpage>1435</lpage>
<pub-id pub-id-type="pmid">16998473</pub-id>
</citation>
</ref>
<ref id="B7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Breiman</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Random forests</article-title>
<source>Mach. Learn.</source>
<year>2001</year>
<volume>45</volume>
<fpage>5</fpage>
<lpage>32</lpage>
</citation>
</ref>
<ref id="B8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname>
<given-names>X</given-names>
</name>
<etal></etal>
</person-group>
<article-title>RankMotif++: a motif-search algorithm that accounts for relative ranks of K-mers in binding transcription factors</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<fpage>i72</fpage>
<lpage>i79</lpage>
<pub-id pub-id-type="pmid">17646348</pub-id>
</citation>
</ref>
<ref id="B9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Clarke</surname>
<given-names>ND</given-names>
</name>
</person-group>
<article-title>Covariation of residues in the homeodomain sequence family</article-title>
<source>Protein Sci.</source>
<year>1995</year>
<volume>4</volume>
<fpage>2269</fpage>
<lpage>2278</lpage>
<pub-id pub-id-type="pmid">8563623</pub-id>
</citation>
</ref>
<ref id="B10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Damante</surname>
<given-names>G</given-names>
</name>
<etal></etal>
</person-group>
<article-title>A molecular code dictates sequence-specific DNA recognition by homeodomains</article-title>
<source>EMBO J.</source>
<year>1996</year>
<volume>15</volume>
<fpage>4992</fpage>
<lpage>5000</lpage>
<pub-id pub-id-type="pmid">8890172</pub-id>
</citation>
</ref>
<ref id="B11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ekker</surname>
<given-names>SC</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The degree of variation in DNA sequence recognition among four Drosophila homeotic proteins</article-title>
<source>EMBO J.</source>
<year>1994</year>
<volume>13</volume>
<fpage>3551</fpage>
<lpage>3560</lpage>
<pub-id pub-id-type="pmid">7914870</pub-id>
</citation>
</ref>
<ref id="B12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Franklin</surname>
<given-names>SB</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Parallel analysis: a method for determining significant principal components</article-title>
<source>J. Veg. Sci.</source>
<year>1995</year>
<volume>6</volume>
<fpage>99</fpage>
<lpage>106</lpage>
</citation>
</ref>
<ref id="B13"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Garey</surname>
<given-names>MR</given-names>
</name>
<name><surname>Johnson</surname>
<given-names>DS</given-names>
</name>
</person-group>
<source>Computers and intractability : a guide to the theory of NP-completeness.</source>
<year>1979</year>
<publisher-loc>New York</publisher-loc>
<publisher-name>W. H. Freeman</publisher-name>
</citation>
</ref>
<ref id="B14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hanes</surname>
<given-names>SD</given-names>
</name>
<name><surname>Brent</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>DNA specificity of the bicoid activator protein is determined by homeodomain recognition helix residue 9</article-title>
<source>Cell</source>
<year>1989</year>
<volume>57</volume>
<fpage>1275</fpage>
<lpage>1283</lpage>
<pub-id pub-id-type="pmid">2500253</pub-id>
</citation>
</ref>
<ref id="B15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kissinger</surname>
<given-names>CR</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Crystal structure of an engrailed homeodomain-DNA complex at 2.8 A resolution: a framework for understanding homeodomain-DNA interactions</article-title>
<source>Cell</source>
<year>1990</year>
<volume>63</volume>
<fpage>579</fpage>
<lpage>590</lpage>
<pub-id pub-id-type="pmid">1977522</pub-id>
</citation>
</ref>
<ref id="B16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Laughon</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>DNA binding specificity of homeodomains</article-title>
<source>Biochemistry</source>
<year>1991</year>
<volume>30</volume>
<fpage>11357</fpage>
<lpage>11367</lpage>
<pub-id pub-id-type="pmid">1742275</pub-id>
</citation>
</ref>
<ref id="B17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname>
<given-names>X</given-names>
</name>
<etal></etal>
</person-group>
<article-title>DIP-chip: rapid and accurate determination of DNA-binding specificity</article-title>
<source>Genome Res.</source>
<year>2005</year>
<volume>15</volume>
<fpage>421</fpage>
<lpage>427</lpage>
<pub-id pub-id-type="pmid">15710749</pub-id>
</citation>
</ref>
<ref id="B18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Miller</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Structural basis for DNA recognition by the basic region leucine zipper transcription factor CCAAT/enhancer-binding protein alpha</article-title>
<source>J. Biol. Chem.</source>
<year>2003</year>
<volume>278</volume>
<fpage>15178</fpage>
<lpage>15184</lpage>
<pub-id pub-id-type="pmid">12578822</pub-id>
</citation>
</ref>
<ref id="B19"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Montgomery</surname>
<given-names>DC</given-names>
</name>
<name><surname>Runger</surname>
<given-names>GC</given-names>
</name>
</person-group>
<source>Applied Statistics and Probability for Engineers.</source>
<year>2007</year>
<publisher-loc>Hoboken, NJ</publisher-loc>
<publisher-name>Wiley</publisher-name>
</citation>
</ref>
<ref id="B20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mukherjee</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays</article-title>
<source>Nat. Genet.</source>
<year>2004</year>
<volume>36</volume>
<fpage>1331</fpage>
<lpage>1339</lpage>
<pub-id pub-id-type="pmid">15543148</pub-id>
</citation>
</ref>
<ref id="B21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pabo</surname>
<given-names>CO</given-names>
</name>
<name><surname>Nekludova</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Geometric analysis and comparison of protein-DNA interfaces: why is there no simple code for recognition?</article-title>
<source>J. Mol. Biol.</source>
<year>2000</year>
<volume>301</volume>
<fpage>597</fpage>
<lpage>624</lpage>
<pub-id pub-id-type="pmid">10966773</pub-id>
</citation>
</ref>
<ref id="B22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Papavassiliou</surname>
<given-names>AG</given-names>
</name>
</person-group>
<article-title>Transcription factors: structure, function, and implication in malignant growth</article-title>
<source>Anticancer Res.</source>
<year>1995</year>
<volume>15</volume>
<fpage>891</fpage>
<lpage>894</lpage>
<pub-id pub-id-type="pmid">7645977</pub-id>
</citation>
</ref>
<ref id="B23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Qian</surname>
<given-names>Z</given-names>
</name>
<etal></etal>
</person-group>
<article-title>An approach to predict transcription factor DNA binding site specificity based upon gene and transcription factor functional categorization</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<fpage>2449</fpage>
<lpage>2454</lpage>
<pub-id pub-id-type="pmid">17623704</pub-id>
</citation>
</ref>
<ref id="B24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stormo</surname>
<given-names>GD</given-names>
</name>
</person-group>
<article-title>DNA binding sites: representation and discovery</article-title>
<source>Bioinformatics</source>
<year>2000</year>
<volume>16</volume>
<fpage>16</fpage>
<lpage>23</lpage>
<pub-id pub-id-type="pmid">10812473</pub-id>
</citation>
</ref>
<ref id="B25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Suzuki</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
<article-title>DNA recognition code of transcription factors</article-title>
<source>Protein Eng.</source>
<year>1995</year>
<volume>8</volume>
<fpage>319</fpage>
<lpage>328</lpage>
<pub-id pub-id-type="pmid">7567917</pub-id>
</citation>
</ref>
<ref id="B26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Suzuki</surname>
<given-names>M</given-names>
</name>
<name><surname>Yagi</surname>
<given-names>N</given-names>
</name>
</person-group>
<article-title>DNA recognition code of transcription factors in the helix-turn–helix, probe helix, hormone receptor, and zinc finger families</article-title>
<source>Proc. Natl Acad. Sci. USA</source>
<year>1994</year>
<volume>91</volume>
<fpage>12357</fpage>
<lpage>12361</lpage>
<pub-id pub-id-type="pmid">7809040</pub-id>
</citation>
</ref>
<ref id="B27"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Warren</surname>
<given-names>CL</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Defining the sequence-recognition profile of DNA-binding molecules</article-title>
<source>Proc. Natl Acad. Sci. USA</source>
<year>2006</year>
<volume>103</volume>
<fpage>867</fpage>
<lpage>872</lpage>
<pub-id pub-id-type="pmid">16418267</pub-id>
</citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000938 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000938 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:2666811
   |texte=   Predicting the binding preference of transcription factors to individual DNA k-mers
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:19088121" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration MERS

Predicting the binding preference of transcription factors to individual DNA k-mers

Predicting the binding preference of transcription factors to individual DNA k-mers

Source :

Abstract

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki