Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Learning “graph-mer” Motifs that Predict Gene Expression Trajectories in Development

Identifieur interne : 000F91 ( Pmc/Corpus ); précédent : 000F90; suivant : 000F92

Learning “graph-mer” Motifs that Predict Gene Expression Trajectories in Development

Auteurs : Xuejing Li ; Casandra Panea ; Chris H. Wiggins ; Valerie Reinke ; Christina Leslie

Source :

RBID : PMC:2861633

Abstract

A key problem in understanding transcriptional regulatory networks is deciphering what cis regulatory logic is encoded in gene promoter sequences and how this sequence information maps to expression. A typical computational approach to this problem involves clustering genes by their expression profiles and then searching for overrepresented motifs in the promoter sequences of genes in a cluster. However, genes with similar expression profiles may be controlled by distinct regulatory programs. Moreover, if many gene expression profiles in a data set are highly correlated, as in the case of whole organism developmental time series, it may be difficult to resolve fine-grained clusters in the first place. We present a predictive framework for modeling the natural flow of information, from promoter sequence to expression, to learn cis regulatory motifs and characterize gene expression patterns in developmental time courses. We introduce a cluster-free algorithm based on a graph-regularized version of partial least squares (PLS) regression to learn sequence patterns—represented by graphs of k-mers, or “graph-mers”—that predict gene expression trajectories. Applying the approach to wildtype germline development in Caenorhabditis elegans, we found that the first and second latent PLS factors mapped to expression profiles for oocyte and sperm genes, respectively. We extracted both known and novel motifs from the graph-mers associated to these germline-specific patterns, including novel CG-rich motifs specific to oocyte genes. We found evidence supporting the functional relevance of these putative regulatory elements through analysis of positional bias, motif conservation and in situ gene expression. This study demonstrates that our regression model can learn biologically meaningful latent structure and identify potentially functional motifs from subtle developmental time course expression data.


Url:
DOI: 10.1371/journal.pcbi.1000761
PubMed: 20454681
PubMed Central: 2861633

Links to Exploration step

PMC:2861633

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Learning “graph-mer” Motifs that Predict Gene Expression Trajectories in Development</title>
<author>
<name sortKey="Li, Xuejing" sort="Li, Xuejing" uniqKey="Li X" first="Xuejing" last="Li">Xuejing Li</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Department of Physics, Columbia University, New York, New York, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Panea, Casandra" sort="Panea, Casandra" uniqKey="Panea C" first="Casandra" last="Panea">Casandra Panea</name>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Genetics, Yale University, New Haven, Connecticut, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Wiggins, Chris H" sort="Wiggins, Chris H" uniqKey="Wiggins C" first="Chris H." last="Wiggins">Chris H. Wiggins</name>
<affiliation>
<nlm:aff id="aff3">
<addr-line>Department of Applied Physics and Applied Mathematics, Columbia University, New York, New York, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Reinke, Valerie" sort="Reinke, Valerie" uniqKey="Reinke V" first="Valerie" last="Reinke">Valerie Reinke</name>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Genetics, Yale University, New Haven, Connecticut, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Leslie, Christina" sort="Leslie, Christina" uniqKey="Leslie C" first="Christina" last="Leslie">Christina Leslie</name>
<affiliation>
<nlm:aff id="aff4">
<addr-line>Computational Biology Program, Sloan-Kettering Institute, New York, New York, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">20454681</idno>
<idno type="pmc">2861633</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2861633</idno>
<idno type="RBID">PMC:2861633</idno>
<idno type="doi">10.1371/journal.pcbi.1000761</idno>
<date when="2010">2010</date>
<idno type="wicri:Area/Pmc/Corpus">000F91</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000F91</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Learning “graph-mer” Motifs that Predict Gene Expression Trajectories in Development</title>
<author>
<name sortKey="Li, Xuejing" sort="Li, Xuejing" uniqKey="Li X" first="Xuejing" last="Li">Xuejing Li</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>Department of Physics, Columbia University, New York, New York, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Panea, Casandra" sort="Panea, Casandra" uniqKey="Panea C" first="Casandra" last="Panea">Casandra Panea</name>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Genetics, Yale University, New Haven, Connecticut, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Wiggins, Chris H" sort="Wiggins, Chris H" uniqKey="Wiggins C" first="Chris H." last="Wiggins">Chris H. Wiggins</name>
<affiliation>
<nlm:aff id="aff3">
<addr-line>Department of Applied Physics and Applied Mathematics, Columbia University, New York, New York, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Reinke, Valerie" sort="Reinke, Valerie" uniqKey="Reinke V" first="Valerie" last="Reinke">Valerie Reinke</name>
<affiliation>
<nlm:aff id="aff2">
<addr-line>Department of Genetics, Yale University, New Haven, Connecticut, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Leslie, Christina" sort="Leslie, Christina" uniqKey="Leslie C" first="Christina" last="Leslie">Christina Leslie</name>
<affiliation>
<nlm:aff id="aff4">
<addr-line>Computational Biology Program, Sloan-Kettering Institute, New York, New York, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PLoS Computational Biology</title>
<idno type="ISSN">1553-734X</idno>
<idno type="eISSN">1553-7358</idno>
<imprint>
<date when="2010">2010</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>A key problem in understanding transcriptional regulatory networks is deciphering what
<italic>cis</italic>
regulatory logic is encoded in gene promoter sequences and how this sequence information maps to expression. A typical computational approach to this problem involves clustering genes by their expression profiles and then searching for overrepresented motifs in the promoter sequences of genes in a cluster. However, genes with similar expression profiles may be controlled by distinct regulatory programs. Moreover, if many gene expression profiles in a data set are highly correlated, as in the case of whole organism developmental time series, it may be difficult to resolve fine-grained clusters in the first place. We present a predictive framework for modeling the natural flow of information, from promoter sequence to expression, to learn
<italic>cis</italic>
regulatory motifs and characterize gene expression patterns in developmental time courses. We introduce a cluster-free algorithm based on a graph-regularized version of partial least squares (PLS) regression to learn sequence patterns—represented by graphs of
<italic>k</italic>
-mers, or “graph-mers”—that predict gene expression trajectories. Applying the approach to wildtype germline development in
<italic>Caenorhabditis elegans</italic>
, we found that the first and second latent PLS factors mapped to expression profiles for oocyte and sperm genes, respectively. We extracted both known and novel motifs from the graph-mers associated to these germline-specific patterns, including novel CG-rich motifs specific to oocyte genes. We found evidence supporting the functional relevance of these putative regulatory elements through analysis of positional bias, motif conservation and
<italic>in situ</italic>
gene expression. This study demonstrates that our regression model can learn biologically meaningful latent structure and identify potentially functional motifs from subtle developmental time course expression data.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Tompa, M" uniqKey="Tompa M">M Tompa</name>
</author>
<author>
<name sortKey="Li, N" uniqKey="Li N">N Li</name>
</author>
<author>
<name sortKey="Bailey, Tl" uniqKey="Bailey T">TL Bailey</name>
</author>
<author>
<name sortKey="Church, Gm" uniqKey="Church G">GM Church</name>
</author>
<author>
<name sortKey="De Moor, Bd" uniqKey="De Moor B">BD De Moor</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tibshirani, R" uniqKey="Tibshirani R">R Tibshirani</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Belkin, M" uniqKey="Belkin M">M Belkin</name>
</author>
<author>
<name sortKey="Niyogi, P" uniqKey="Niyogi P">P Niyogi</name>
</author>
<author>
<name sortKey="Sindhwani, V" uniqKey="Sindhwani V">V Sindhwani</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ng, Ay" uniqKey="Ng A">AY Ng</name>
</author>
<author>
<name sortKey="Jordan, Mi" uniqKey="Jordan M">MI Jordan</name>
</author>
<author>
<name sortKey="Weiss, Y" uniqKey="Weiss Y">Y Weiss</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rapaport, F" uniqKey="Rapaport F">F Rapaport</name>
</author>
<author>
<name sortKey="Zinovyev, A" uniqKey="Zinovyev A">A Zinovyev</name>
</author>
<author>
<name sortKey="Dutreix, M" uniqKey="Dutreix M">M Dutreix</name>
</author>
<author>
<name sortKey="Barillot, E" uniqKey="Barillot E">E Barillot</name>
</author>
<author>
<name sortKey="Vert, Jp" uniqKey="Vert J">JP Vert</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bailey, Tl" uniqKey="Bailey T">TL Bailey</name>
</author>
<author>
<name sortKey="Elkan, C" uniqKey="Elkan C">C Elkan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Reinke, V" uniqKey="Reinke V">V Reinke</name>
</author>
<author>
<name sortKey="Gil, Is" uniqKey="Gil I">IS Gil</name>
</author>
<author>
<name sortKey="Ward, S" uniqKey="Ward S">S Ward</name>
</author>
<author>
<name sortKey="Kazmer, K" uniqKey="Kazmer K">K Kazmer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Boulesteix, Al" uniqKey="Boulesteix A">AL Boulesteix</name>
</author>
<author>
<name sortKey="Strimmer, K" uniqKey="Strimmer K">K Strimmer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bader, Gd" uniqKey="Bader G">GD Bader</name>
</author>
<author>
<name sortKey="Hogue, Cwv" uniqKey="Hogue C">CWV Hogue</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shim, Y" uniqKey="Shim Y">Y Shim</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="J, C" uniqKey="J C">C J</name>
</author>
<author>
<name sortKey="K, Sm" uniqKey="K S">SM K</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Waterston, R" uniqKey="Waterston R">R Waterston</name>
</author>
<author>
<name sortKey="Lindblad Toh, K" uniqKey="Lindblad Toh K">K Lindblad-Toh</name>
</author>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E Birney</name>
</author>
<author>
<name sortKey="Rogers, J" uniqKey="Rogers J">J Rogers</name>
</author>
<author>
<name sortKey="Abril, J" uniqKey="Abril J">J Abril</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Raychaudhuri, S" uniqKey="Raychaudhuri S">S Raychaudhuri</name>
</author>
<author>
<name sortKey="Stuart, J" uniqKey="Stuart J">J Stuart</name>
</author>
<author>
<name sortKey="Altman, R" uniqKey="Altman R">R Altman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Beer, Ma" uniqKey="Beer M">MA Beer</name>
</author>
<author>
<name sortKey="Tavazoie, S" uniqKey="Tavazoie S">S Tavazoie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ernst, J" uniqKey="Ernst J">J Ernst</name>
</author>
<author>
<name sortKey="Vainas, O" uniqKey="Vainas O">O Vainas</name>
</author>
<author>
<name sortKey="Harbison, Ct" uniqKey="Harbison C">CT Harbison</name>
</author>
<author>
<name sortKey="Simon, I" uniqKey="Simon I">I Simon</name>
</author>
<author>
<name sortKey="Bar Joseph, Z" uniqKey="Bar Joseph Z">Z Bar-Joseph</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Segal, E" uniqKey="Segal E">E Segal</name>
</author>
<author>
<name sortKey="Shapira, M" uniqKey="Shapira M">M Shapira</name>
</author>
<author>
<name sortKey="Regev, A" uniqKey="Regev A">A Regev</name>
</author>
<author>
<name sortKey="Pe Er, D" uniqKey="Pe Er D">D Pe'er</name>
</author>
<author>
<name sortKey="Botstein, D" uniqKey="Botstein D">D Botstein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Middendorf, M" uniqKey="Middendorf M">M Middendorf</name>
</author>
<author>
<name sortKey="Kundaje, A" uniqKey="Kundaje A">A Kundaje</name>
</author>
<author>
<name sortKey="Shah, M" uniqKey="Shah M">M Shah</name>
</author>
<author>
<name sortKey="Freund, Y" uniqKey="Freund Y">Y Freund</name>
</author>
<author>
<name sortKey="Wiggins, C" uniqKey="Wiggins C">C Wiggins</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kundaje, A" uniqKey="Kundaje A">A Kundaje</name>
</author>
<author>
<name sortKey="Xin, X" uniqKey="Xin X">X Xin</name>
</author>
<author>
<name sortKey="Lan, C" uniqKey="Lan C">C Lan</name>
</author>
<author>
<name sortKey="Lianoglou, S" uniqKey="Lianoglou S">S Lianoglou</name>
</author>
<author>
<name sortKey="Zhou, M" uniqKey="Zhou M">M Zhou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bussemaker, Hj" uniqKey="Bussemaker H">HJ Bussemaker</name>
</author>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Siggia, Ed" uniqKey="Siggia E">ED Siggia</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, Nr" uniqKey="Zhang N">NR Zhang</name>
</author>
<author>
<name sortKey="Wildermuth, Mc" uniqKey="Wildermuth M">MC Wildermuth</name>
</author>
<author>
<name sortKey="Speed, Tp" uniqKey="Speed T">TP Speed</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bonneau, R" uniqKey="Bonneau R">R Bonneau</name>
</author>
<author>
<name sortKey="Reiss, D" uniqKey="Reiss D">D Reiss</name>
</author>
<author>
<name sortKey="Shannon, P" uniqKey="Shannon P">P Shannon</name>
</author>
<author>
<name sortKey="Facciotti, M" uniqKey="Facciotti M">M Facciotti</name>
</author>
<author>
<name sortKey="Hood, L" uniqKey="Hood L">L Hood</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brilli, M" uniqKey="Brilli M">M Brilli</name>
</author>
<author>
<name sortKey="Fani, R" uniqKey="Fani R">R Fani</name>
</author>
<author>
<name sortKey="Li, P" uniqKey="Li P">P Lió</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Naughton, Bt" uniqKey="Naughton B">BT Naughton</name>
</author>
<author>
<name sortKey="Fratkin, E" uniqKey="Fratkin E">E Fratkin</name>
</author>
<author>
<name sortKey="Batzoglou, S" uniqKey="Batzoglou S">S Batzoglou</name>
</author>
<author>
<name sortKey="Brutlag, Dl" uniqKey="Brutlag D">DL Brutlag</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Das, D" uniqKey="Das D">D Das</name>
</author>
<author>
<name sortKey="Pellegrini, M" uniqKey="Pellegrini M">M Pellegrini</name>
</author>
<author>
<name sortKey="Gray, Jw" uniqKey="Gray J">JW Gray</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Segal, E" uniqKey="Segal E">E Segal</name>
</author>
<author>
<name sortKey="Raveh Sadka, T" uniqKey="Raveh Sadka T">T Raveh-Sadka</name>
</author>
<author>
<name sortKey="Schroeder, M" uniqKey="Schroeder M">M Schroeder</name>
</author>
<author>
<name sortKey="Unnerstall, U" uniqKey="Unnerstall U">U Unnerstall</name>
</author>
<author>
<name sortKey="Gaul, U" uniqKey="Gaul U">U Gaul</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wang, L" uniqKey="Wang L">L Wang</name>
</author>
<author>
<name sortKey="Chen, G" uniqKey="Chen G">G Chen</name>
</author>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hill, A" uniqKey="Hill A">A Hill</name>
</author>
<author>
<name sortKey="Hunter, C" uniqKey="Hunter C">C Hunter</name>
</author>
<author>
<name sortKey="Tsung, B" uniqKey="Tsung B">B Tsung</name>
</author>
<author>
<name sortKey="Tucker Kellogg, G" uniqKey="Tucker Kellogg G">G Tucker-Kellogg</name>
</author>
<author>
<name sortKey="Brown, E" uniqKey="Brown E">E Brown</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jong, S" uniqKey="Jong S">S Jong</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Weinberger, Kq" uniqKey="Weinberger K">KQ Weinberger</name>
</author>
<author>
<name sortKey="Sha, F" uniqKey="Sha F">F Sha</name>
</author>
<author>
<name sortKey="Zhu, Q" uniqKey="Zhu Q">Q Zhu</name>
</author>
<author>
<name sortKey="Saul, Lk" uniqKey="Saul L">LK Saul</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chung, Frk" uniqKey="Chung F">FRK Chung</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Eskin, E" uniqKey="Eskin E">E Eskin</name>
</author>
<author>
<name sortKey="Gelfand, M" uniqKey="Gelfand M">M Gelfand</name>
</author>
<author>
<name sortKey="Pevzner, P" uniqKey="Pevzner P">P Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shannon, P" uniqKey="Shannon P">P Shannon</name>
</author>
<author>
<name sortKey="Markiel, A" uniqKey="Markiel A">A Markiel</name>
</author>
<author>
<name sortKey="Ozier, O" uniqKey="Ozier O">O Ozier</name>
</author>
<author>
<name sortKey="Baliga, Ns" uniqKey="Baliga N">NS Baliga</name>
</author>
<author>
<name sortKey="Wang, Jt" uniqKey="Wang J">JT Wang</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">PLoS Comput Biol</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS Comput. Biol</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">ploscomp</journal-id>
<journal-title-group>
<journal-title>PLoS Computational Biology</journal-title>
</journal-title-group>
<issn pub-type="ppub">1553-734X</issn>
<issn pub-type="epub">1553-7358</issn>
<publisher>
<publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">20454681</article-id>
<article-id pub-id-type="pmc">2861633</article-id>
<article-id pub-id-type="publisher-id">09-PLCB-RA-0689R3</article-id>
<article-id pub-id-type="doi">10.1371/journal.pcbi.1000761</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
<subj-group subj-group-type="Discipline">
<subject>Genetics and Genomics/Bioinformatics</subject>
<subject>Molecular Biology/Bioinformatics</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Learning “graph-mer” Motifs that Predict Gene Expression Trajectories in Development</article-title>
<alt-title alt-title-type="running-head">Learning Motifs that Predict Gene Expression</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Li</surname>
<given-names>Xuejing</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Panea</surname>
<given-names>Casandra</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Wiggins</surname>
<given-names>Chris H.</given-names>
</name>
<xref ref-type="aff" rid="aff3">
<sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Reinke</surname>
<given-names>Valerie</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Leslie</surname>
<given-names>Christina</given-names>
</name>
<xref ref-type="aff" rid="aff4">
<sup>4</sup>
</xref>
<xref ref-type="corresp" rid="cor1">
<sup>*</sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff1">
<label>1</label>
<addr-line>Department of Physics, Columbia University, New York, New York, United States of America</addr-line>
</aff>
<aff id="aff2">
<label>2</label>
<addr-line>Department of Genetics, Yale University, New Haven, Connecticut, United States of America</addr-line>
</aff>
<aff id="aff3">
<label>3</label>
<addr-line>Department of Applied Physics and Applied Mathematics, Columbia University, New York, New York, United States of America</addr-line>
</aff>
<aff id="aff4">
<label>4</label>
<addr-line>Computational Biology Program, Sloan-Kettering Institute, New York, New York, United States of America</addr-line>
</aff>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Regev</surname>
<given-names>Aviv</given-names>
</name>
<role>Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1">Broad Institute of MIT and Harvard, United States of America</aff>
<author-notes>
<corresp id="cor1">* E-mail:
<email>cleslie@cbio.mskcc.org</email>
</corresp>
<fn fn-type="con">
<p>Conceived and designed the experiments: XL CP CHW VR CL. Performed the experiments: XL CP. Analyzed the data: XL CP CHW VR CL. Contributed reagents/materials/analysis tools: XL CP CHW VR CL. Wrote the paper: XL VR CL.</p>
</fn>
</author-notes>
<pub-date pub-type="collection">
<month>4</month>
<year>2010</year>
</pub-date>
<pub-date pub-type="epub">
<day>29</day>
<month>4</month>
<year>2010</year>
</pub-date>
<volume>6</volume>
<issue>4</issue>
<elocation-id>e1000761</elocation-id>
<history>
<date date-type="received">
<day>22</day>
<month>6</month>
<year>2009</year>
</date>
<date date-type="accepted">
<day>24</day>
<month>3</month>
<year>2010</year>
</date>
</history>
<permissions>
<copyright-statement>Li et al.</copyright-statement>
<copyright-year>2010</copyright-year>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.</license-p>
</license>
</permissions>
<abstract>
<p>A key problem in understanding transcriptional regulatory networks is deciphering what
<italic>cis</italic>
regulatory logic is encoded in gene promoter sequences and how this sequence information maps to expression. A typical computational approach to this problem involves clustering genes by their expression profiles and then searching for overrepresented motifs in the promoter sequences of genes in a cluster. However, genes with similar expression profiles may be controlled by distinct regulatory programs. Moreover, if many gene expression profiles in a data set are highly correlated, as in the case of whole organism developmental time series, it may be difficult to resolve fine-grained clusters in the first place. We present a predictive framework for modeling the natural flow of information, from promoter sequence to expression, to learn
<italic>cis</italic>
regulatory motifs and characterize gene expression patterns in developmental time courses. We introduce a cluster-free algorithm based on a graph-regularized version of partial least squares (PLS) regression to learn sequence patterns—represented by graphs of
<italic>k</italic>
-mers, or “graph-mers”—that predict gene expression trajectories. Applying the approach to wildtype germline development in
<italic>Caenorhabditis elegans</italic>
, we found that the first and second latent PLS factors mapped to expression profiles for oocyte and sperm genes, respectively. We extracted both known and novel motifs from the graph-mers associated to these germline-specific patterns, including novel CG-rich motifs specific to oocyte genes. We found evidence supporting the functional relevance of these putative regulatory elements through analysis of positional bias, motif conservation and
<italic>in situ</italic>
gene expression. This study demonstrates that our regression model can learn biologically meaningful latent structure and identify potentially functional motifs from subtle developmental time course expression data.</p>
</abstract>
<abstract abstract-type="summary">
<title>Author Summary</title>
<p>A major challenge in functional genomics is to decipher the gene regulatory networks operating in multi-cellular organisms, such as the nematode
<italic>C. elegans</italic>
. The expression level of a gene is controlled, to a great extent, by regulatory proteins called transcription factors that bind short motifs in the gene's promoter (regulatory region in the non-coding DNA). In a temporal regulatory process, for example in development, the “regulatory logic” of DNA motifs in the promoter largely determines the gene's expression trajectory, as the gene responds over time to changing concentrations of the transcription factors that control it. This study addresses the problem of learning DNA motifs that predict temporal expression profiles, using genomewide expression data from developmental time series in
<italic>C. elegans</italic>
. We developed a novel algorithm based on techniques from multivariate regression that sets up a correspondence between sequence patterns and expression trajectories. Sequence motifs are represented as graphs of sequence-similar
<italic>k</italic>
-length subsequences called “graph-mers”. By applying the method to germline development in
<italic>C. elegans</italic>
, we found both known and novel DNA motifs associated with oocyte and sperm genes.</p>
</abstract>
<counts>
<page-count count="13"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec id="s1">
<title>Introduction</title>
<p>The mRNA expression level of a gene is regulated by multiple input signals that are integrated by the
<italic>cis</italic>
regulatory logic encoded in the gene's promoter. Genes whose regulatory sequences contain similar DNA motifs are likely to have correlated expression profiles across a given set of experimental conditions. The converse, however, is not necessarily true. That is, genes can have correlated expression profiles without being coregulated, since multiple regulatory programs may lead to similar patterns of differential expression. This is particularly evident in developmental time series data, in which the genes exhibit only a few distinct expression patterns. Nevertheless, computational approaches for deciphering gene regulatory networks from gene expression and promoter sequence data often do assume that correlation implies coregulation. For example, a typical computational strategy is to cluster genes by their expression profiles and then apply motif discovery algorithms to the promoter sequences for each cluster. The cluster-first motif discovery approach is indeed so prevalent that the best-known benchmarking study of motif discovery algorithms
<xref rid="pcbi.1000761-Tompa1" ref-type="bibr">[1]</xref>
defines the problem in precisely this way – namely, given a cluster of genes, find the overrepresented motif(s) in the promoter sequences – and compares numerous such algorithms. It is clear, however, that assigning genes to static clusters that are assumed to be coregulated oversimplifies the biology of transcriptional regulation. Moreover, in a setting where there are few experiments probing the conditions of interest or where many genes have synchronized expression profiles, such as in a time course, clustering may fail to resolve meaningful gene sets for subsequent motif analysis.</p>
<p>In the current work, we present an algorithm that models the natural flow of information, from sequence to expression, to learn cis regulatory motifs and to characterize gene expression patterns. Our algorithm learns motifs that help to predict the full expression profiles of genes over a set of experiments, with no clustering. More precisely, we use a novel algorithm based on partial least squares (PLS) regression to learn a mapping from the set of
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e001.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers in a promoter to the expression profile of the gene across experiments; in time series, we learn
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e002.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers that help to predict the full expression time course for genes. PLS combines dimensionality reduction and regression; it iteratively finds latent factors in the input space with maximal covariance with projections in the output space. We introduce a graph-regularized version of the PLS algorithm to enable motif discovery by imposing two constraints: a lasso
<xref rid="pcbi.1000761-Tibshirani1" ref-type="bibr">[2]</xref>
constraint for sparsity and a graph Laplacian constraint for smoothness over sequence-similar motifs. Our novel graph-regularized PLS algorithm can be used in any situation where the input features are related by a graph structure. Here, the graph structure is defined on the feature space of
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e003.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers, with edges connecting pairs of similar
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e004.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers. Our approach is motivated by recent machine learning work that uses the graph Laplacian to exploit graph structure in various ways, for example, by defining a graph over training examples in semi-supervised classification (Laplacian SVM
<xref rid="pcbi.1000761-Belkin1" ref-type="bibr">[3]</xref>
) and clustering (spectral clustering
<xref rid="pcbi.1000761-Ng1" ref-type="bibr">[4]</xref>
) as well as imposing graph smoothness on features of an SVM classifier
<xref rid="pcbi.1000761-Rapaport1" ref-type="bibr">[5]</xref>
.</p>
<p>Our focus in this study is discovering regulatory elements and deciphering transcriptional regulation in the nematode
<italic>Caenorhabditis elegans</italic>
, a key model organism in developmental biology. In particular, we are interested in using mRNA profiling experiments from developmental time courses, where the high global level of correlation presents a challenge to clustering. Dissection of gene regulatory logic is not as advanced in
<italic>C. elegans</italic>
as it is in
<italic>D. melanogaster</italic>
, for example. There are few motif discovery programs designed specifically for worms, and while worm biologists do use generic programs such as MEME
<xref rid="pcbi.1000761-Bailey1" ref-type="bibr">[6]</xref>
, traditionally they have relied on experimental strategies to define binding motifs and then performed genome-wide motif searches and validation with transgene reporters. One goal of our work is to advance this area of inquiry by defining novel elements and providing new opportunities for directed experimental validation.</p>
<p>As a demonstration of our method, we applied our graph-regularized PLS algorithm to an expression time course for wildtype germline development in
<italic>C. elegans</italic>
<xref rid="pcbi.1000761-Reinke1" ref-type="bibr">[7]</xref>
. We found that the first and second PLS latent factors mapped to expression profiles for oocyte and sperm genes, respectively. In each iteration of our approach, we learn sequence information in the form of a “graph-mer”, i.e. a graph where vertices are
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e005.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers, weighted by their contribution to the latent factor, and edges join
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e006.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers that are close in Hamming distance. To parse the motif graphs into component motifs, we applied a graph module discovery algorithm followed by hierarchical agglomeration to produce position specific scoring matrices (PSSMs) from the weighted
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e007.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers. Applying this procedure to the significant latent factors generated a collection of known and novel oocyte- and sperm-specific motifs, including novel CG-rich motifs associated with oocyte expression trajectories. One graph-mer derived sperm motif was a bHLH binding site motif and exhibited spatial bias in the promoters of sperm genes but not non-sperm genes. The functional relevance of the CG-rich motifs was supported by strong conservation between
<italic>C. elegans</italic>
and
<italic>C. briggsae</italic>
and was associated with germline-specific
<italic>in situ</italic>
expression patterns. This study gives an interesting proof of principle for using PLS regression models for transcriptional regulation in developmental time series.</p>
</sec>
<sec id="s2">
<title>Results</title>
<sec id="s2a">
<title>Learning graph-mer motifs and corresponding expression trajectories</title>
<p>In order to learn the correspondence between (sets of) regulatory motifs in the promoter sequences of genes and gene expression trajectories over a time course, we posed a regression problem: using a training set of
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e008.jpg" mimetype="image"></inline-graphic>
</inline-formula>
genes, learn a linear mapping from the vector of counts of
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e009.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer occurrences in a gene's promoter to the gene's time course expression profile. This model can then be used to predict expression from sequence on held-out genes, and
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e010.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer features that are highly weighted in the model should represent important regulatory motifs. Here we have a very high-dimensional input space of motifs (
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e011.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers) as well as a multivariate output space, both of which rule out use of ordinary least squares regression. Instead, our algorithm makes use of a partial least squares (PLS) regression strategy. PLS is a well-known statistical technique for fitting linear models when the input space is high dimensional
<xref rid="pcbi.1000761-Boulesteix1" ref-type="bibr">[8]</xref>
and has both univariate and multivariate formulations.</p>
<p>Standard PLS represents the input data as a motif matrix
<bold>X</bold>
(dimension
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e012.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, where
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e013.jpg" mimetype="image"></inline-graphic>
</inline-formula>
is the number of
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e014.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers), representing
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e015.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer counts for each gene's promoter, and the gene expression matrix by
<bold>Y</bold>
(dimension
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e016.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, where
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e017.jpg" mimetype="image"></inline-graphic>
</inline-formula>
is the number of experiments), and then it performs two basic steps (see
<xref ref-type="sec" rid="s4">Methods</xref>
for more details):</p>
<list list-type="order">
<list-item>
<p>Construct
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e018.jpg" mimetype="image"></inline-graphic>
</inline-formula>
<italic>weight</italic>
vectors
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e019.jpg" mimetype="image"></inline-graphic>
</inline-formula>
in
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e020.jpg" mimetype="image"></inline-graphic>
</inline-formula>
and corresponding
<italic>latent</italic>
factors
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e021.jpg" mimetype="image"></inline-graphic>
</inline-formula>
in
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e022.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, where the weight vectors are chosen so that the latent factors have maximal covariance with directions in
<bold>Y</bold>
. The latent factors define a reduced dimensional representation of the promoter sequence data.</p>
</list-item>
<list-item>
<p>Regress
<bold>Y</bold>
against the latent factors using ordinary least squares (or ridge) regression. The latent factor dimensionality reduction followed by linear mapping to
<bold>Y</bold>
yields the final mapping from sequence to expression.</p>
</list-item>
</list>
<p>PLS algorithms typically work iteratively, so that each round
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e023.jpg" mimetype="image"></inline-graphic>
</inline-formula>
generates a new latent factor, and the number of rounds
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e024.jpg" mimetype="image"></inline-graphic>
</inline-formula>
is chosen by cross-validation to minimize the square loss function in the regression problem.</p>
<p>Here, we are most interested in what PLS tells us about the covariance structure between
<bold>X</bold>
and
<bold>Y</bold>
and how to interpret this information in terms of sequence motifs and expression patterns. In particular, along with
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e025.jpg" mimetype="image"></inline-graphic>
</inline-formula>
weight vectors
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e026.jpg" mimetype="image"></inline-graphic>
</inline-formula>
in the input motif space, PLS determines corresponding vectors
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e027.jpg" mimetype="image"></inline-graphic>
</inline-formula>
in the output expression space, defined so that
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e028.jpg" mimetype="image"></inline-graphic>
</inline-formula>
is maximal (
<xref ref-type="fig" rid="pcbi-1000761-g001">Figure 1</xref>
). Intuitively, each weight vector
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e029.jpg" mimetype="image"></inline-graphic>
</inline-formula>
corresponds to a set of motifs (
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e030.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers) that helps explain expression patterns in the direction
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e031.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. The components of the vector
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e032.jpg" mimetype="image"></inline-graphic>
</inline-formula>
that have large positive weights are the
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e033.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers that most strongly predict the expression pattern
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e034.jpg" mimetype="image"></inline-graphic>
</inline-formula>
.</p>
<fig id="pcbi-1000761-g001" position="float">
<object-id pub-id-type="doi">10.1371/journal.pcbi.1000761.g001</object-id>
<label>Figure 1</label>
<caption>
<title>Mapping between motif weight vectors and experiment weight vectors.</title>
<p>At each iteration
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e035.jpg" mimetype="image"></inline-graphic>
</inline-formula>
of the modified PLS algorithm,
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e036.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, weight vectors
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e037.jpg" mimetype="image"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e038.jpg" mimetype="image"></inline-graphic>
</inline-formula>
are derived by finding latent factors
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e039.jpg" mimetype="image"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e040.jpg" mimetype="image"></inline-graphic>
</inline-formula>
with maximal covariance. For clarity, subscripts
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e041.jpg" mimetype="image"></inline-graphic>
</inline-formula>
are omitted in the diagram and in the rest of the description. Each weight vector
<bold>w</bold>
is a vector in
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e042.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, where
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e043.jpg" mimetype="image"></inline-graphic>
</inline-formula>
is the number of
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e044.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers used as input to the algorithm. Due to graph-regularization, each weight vector is sparse, i.e. most
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e045.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers have weight
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e046.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, and smooth over a graph connecting sequence-similar
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e047.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers, i.e. similar
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e048.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers get assigned similar weights. Therefore, we can visualize the weight vector as a “graph-mer”, a graph where nodes correspond to
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e049.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers with high positive weights and edges connect sequence-similar
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e050.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers (bottom left). At each iteration, the PLS procedure sets up a correspondence between the motif weight vector
<bold>w</bold>
and a weight vector over expression experiments represented by vector
<bold>c</bold>
. In our setting, the series of expression experiments is a time course, and the vector
<bold>c</bold>
can be viewed as an expression pattern or trajectory (bottom right). Intuitively, we can think of the set of
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e051.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers shown in the graph-mer as driving the expression pattern
<bold>c</bold>
. Roughly speaking, the model predicts that genes containing these
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e052.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers will have expression patterns that correlate with
<bold>c</bold>
; more precisely, the full regression model predicts gene expression patterns using all
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e053.jpg" mimetype="image"></inline-graphic>
</inline-formula>
latent factors.</p>
</caption>
<graphic xlink:href="pcbi.1000761.g001"></graphic>
</fig>
<p>To obtain a more interpretable model, we mathematically imposed two additional requirements on the PLS solution. First, we wanted the weight vectors
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e054.jpg" mimetype="image"></inline-graphic>
</inline-formula>
to be sparse, i.e. we wanted relatively few
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e055.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers to have non-zero components, so that the algorithm produces a small number of hopefully functional motifs. Second, for each weight vector
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e056.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, we wanted sequence-similar
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e057.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers to have similar weights, since such
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e058.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers may represent variants of the same binding site and potentially should contribute in the same way to the linear model. We achieved the first goal by adding a lasso constraint to the PLS optimization problem (see
<xref ref-type="sec" rid="s4">Methods</xref>
, equation (4)). For the second goal, we defined a graph on the set of
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e059.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers, joining two
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e060.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers by an edge exactly when they are close in Hamming distance, and imposed a graph Laplacian constraint to obtain smoothness over the graph (see
<xref ref-type="sec" rid="s4">Methods</xref>
, equation (7)). Incorporating these constraints into a multivariate PLS approach yields a new algorithm that we call graph-regularized PLS.</p>
<p>With these additional constraints, we can view the motif vectors
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e061.jpg" mimetype="image"></inline-graphic>
</inline-formula>
as “graph-mers” – weighted graphs over
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e062.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers, where highly weighted dense clusters in the graphs correspond to important sequence-similar
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e063.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer sets, or motifs.
<xref ref-type="fig" rid="pcbi-1000761-g001">Figure 1</xref>
illustrates the mapping between motif weight vectors, interpreted as graph-mers, and corresponding expression patterns, arising from the latent factors found in graph-regularized PLS. Intuitively, we can think of each vector
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e064.jpg" mimetype="image"></inline-graphic>
</inline-formula>
as the expression pattern driven by the positively weighted
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e065.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers in
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e066.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, that is, the common expression trajectory displayed by genes containing these motifs. This correspondence will be important for interpreting regulatory motifs in worm germline development below.</p>
</sec>
<sec id="s2b">
<title>Graph-mer modeling for germline development in worm</title>
<p>We applied our graph-regularized PLS regression algorithm to time series gene expression data for wild-type germline development in worm
<italic>C. elegans</italic>
<xref rid="pcbi.1000761-Reinke1" ref-type="bibr">[7]</xref>
. This data set consists of a time course beginning in the middle of the third larval stage (L3) and extending through adulthood. During this time, the major developmental changes occur in the germ line. Some germ cells undergo constant proliferation, while others initiate developmental events, including entry into meiosis followed by differentiation into sperm, which occurs in the fourth larval stage, or differentiation into oocytes, which occurs in young adults. By the end of the timecourse, animals have produced mature gametes and launched embryogenesis. Twelve samples were collected at 3-hour intervals with 3 replicates for each sample. Basic microarray data normalization was performed in the original study, and we used the normalized gene expression levels as reported (Gene Expression Omnibus,
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/geo/">http://www.ncbi.nlm.nih.gov/geo/</ext-link>
, accession numbers GSE726-GSE737). We averaged expression levels over replicates for 20,000 genes and calculated the 5% and 95% quantile of all expression values. We filtered out genes with baseline expression (defined here as having expression values between the 5% and 95% quantiles at all time points) and also ones that exhibit little variance in expression over time (
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e067.jpg" mimetype="image"></inline-graphic>
</inline-formula>
). After further removing genes without upstream sequences from WormMart, we obtained the gene expression matrix for
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e068.jpg" mimetype="image"></inline-graphic>
</inline-formula>
9,000 genes and 12 time points.</p>
<p>We downloaded promoter sequences spanning 500 bp upstream of transcription start sites from WormMart. For genes whose upstream intergenic sequence is shorter than 500 bps, we used the intergenic sequences instead of 500 bps upstream. We scanned the promoter sequences for candidate 6-mers and 7-mers, and filtered
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e069.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers based on expected counts in background sequences (see
<xref ref-type="sec" rid="s4">Methods</xref>
).</p>
</sec>
<sec id="s2c">
<title>Regularized PLS predicts held-out gene expression</title>
<p>We performed 10-fold cross-validation experiments, randomly splitting genes into test and training sets with 10% of the data assigned to test data.
<xref ref-type="fig" rid="pcbi-1000761-g002">Figure 2A</xref>
illustrates the normalized mean squared error (see
<xref ref-type="sec" rid="s4">Methods</xref>
, equation (1)) on the cross-validation test sets versus number of latent factors for both standard and graph-regularized PLS. Here, the mean squared error obtained with zero latent factors (i.e. the variance of the test data) is normalized to 1, so that cross-validation errors below 1 indicate that the model is explaining part of the variance of the held-out data.
<xref ref-type="fig" rid="pcbi-1000761-g002">Figure 2A</xref>
shows the average mean squared error across the cross-validation folds with the standard deviation over folds indicated with error bars. The minimal cross-validation error with standard PLS is obtained with four latent factors. Graph-regularized PLS appears to be more resistant to overfitting, with slightly lower cross-validation error at four latent factors and no substantial increase in error as the number of latent factors increases. Again, cross-validation error suggests that four latent factors should be used in the model. As a negative control, we randomly paired promoter sequences with expression profiles, so that we used real expression data and promoter sequences but lost the correspondence between sequence and expression, and we performed standard PLS and graph-regularized PLS . As can be seen from
<xref ref-type="fig" rid="pcbi-1000761-g002">Figure 2A</xref>
, both standard PLS and graph-regularized PLS on randomized data overfit with the very first latent factor, indicating that the performance obtained on the real data is meaningful.</p>
<fig id="pcbi-1000761-g002" position="float">
<object-id pub-id-type="doi">10.1371/journal.pcbi.1000761.g002</object-id>
<label>Figure 2</label>
<caption>
<title>Normalized mean squared error on cross-validation test data.</title>
<p>(A) Normalized mean squared error versus number of latent factors for standard PLS and graph-regularized PLS on real and randomized data. The mean squared error obtained with zero latent factor is normalized to 1. Computed standard deviations of squared error across cross-validation sets are plotted as error bars. For the real cross-validation data, standard PLS overfits after the 4th factor; graph-regularized PLS is more resistant to overfitting than standard PLS. As expected, when trained and tested on randomized data, both standard and graph-regularized PLS overfit with the very first factor. (B) Normalized mean squared error of sperm and oocyte gene sets for graph-regularized PLS. The first and second factors dominate oocyte and sperm genes respectively in terms of largest chi-square reduction.</p>
</caption>
<graphic xlink:href="pcbi.1000761.g002"></graphic>
</fig>
</sec>
<sec id="s2d">
<title>Latent factors map to germline-specific expression trajectories</title>
<p>By analyzing separate microarray expression data from germline mutants, the previous study also identified two gene sets consisting of sperm and oocyte genes
<xref rid="pcbi.1000761-Reinke1" ref-type="bibr">[7]</xref>
, which we used in our analysis of the wild type developmental gene expression profiles. First, we estimated the prediction error on each gene set as shown in
<xref ref-type="fig" rid="pcbi-1000761-g002">Figure 2B</xref>
. Clearly, the first and second latent factors account for the largest loss reduction for oocyte and sperm genes, respectively. To show that the first two factors dominate these two gene sets, we first examined the expression profiles of the two gene sets. In PLS, each weight vector
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e070.jpg" mimetype="image"></inline-graphic>
</inline-formula>
gives the weights over time points and can be interpreted as an expression pattern, and genes significantly influenced by the latent factor tend to follow this expression pattern. We plot the oocyte gene expression profiles together with
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e071.jpg" mimetype="image"></inline-graphic>
</inline-formula>
and sperm gene expression profiles with
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e072.jpg" mimetype="image"></inline-graphic>
</inline-formula>
in
<xref ref-type="fig" rid="pcbi-1000761-g003">Figure 3A and 3B</xref>
. The gene expression profiles are strongly correlated with the corresponding weight vectors, indicating that the first two factors are able to retrieve the expression patterns of these two gene sets, respectively. Furthermore, we used functional enrichment analysis to confirm that the genes identified based on correlation with weight vector by these two factors are indeed enriched for oocyte and sperm genes, respectively (
<xref ref-type="supplementary-material" rid="pcbi.1000761.s001">Figure S1</xref>
(A,B)).</p>
<fig id="pcbi-1000761-g003" position="float">
<object-id pub-id-type="doi">10.1371/journal.pcbi.1000761.g003</object-id>
<label>Figure 3</label>
<caption>
<title>Correlation of germ cell expression patterns and PLS expression weight vectors.</title>
<p>Oocyte and sperm gene expression patterns are strongly correlated with
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e073.jpg" mimetype="image"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e074.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, respectively. (A) Oocyte gene expression versus
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e075.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. (B) Sperm gene expression versus
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e076.jpg" mimetype="image"></inline-graphic>
</inline-formula>
.</p>
</caption>
<graphic xlink:href="pcbi.1000761.g003"></graphic>
</fig>
</sec>
<sec id="s2e">
<title>Interpretation of motif weight vectors</title>
<p>In PLS, each weight vector
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e077.jpg" mimetype="image"></inline-graphic>
</inline-formula>
corresponds to a set of motifs (
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e078.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers) that help to explain expression patterns in the direction
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e079.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. The
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e080.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers with largest coefficients in
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e081.jpg" mimetype="image"></inline-graphic>
</inline-formula>
are the most important variables for predicting the projection of the expression patterns of genes onto
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e082.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. To identify motifs relevant for sperm and oocyte gene sets, we selected the top 50
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e083.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers ranked by
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e084.jpg" mimetype="image"></inline-graphic>
</inline-formula>
and examined the
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e085.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer graphs corresponding to the first two latent factors. Clusters in the graph that are identified by MCODE
<xref rid="pcbi.1000761-Bader1" ref-type="bibr">[9]</xref>
represent motif patterns and hierarchical sequence clustering is performed to generate corresponding PSSMs.
<xref ref-type="fig" rid="pcbi-1000761-g004">Figures 4A</xref>
and
<xref ref-type="fig" rid="pcbi-1000761-g005">5A</xref>
show the graph-mer representation of the top 50
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e086.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers, motif patterns and PSSMs for the first two factors.</p>
<fig id="pcbi-1000761-g004" position="float">
<object-id pub-id-type="doi">10.1371/journal.pcbi.1000761.g004</object-id>
<label>Figure 4</label>
<caption>
<title>Sperm motifs determined by graph-mer analysis and positional bias of motif ACGTG.</title>
<p>(A) Sperm motifs extracted from graph-mer output. The graph-mer consisting of the top 50
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e087.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers ranked by
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e088.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. Graph motif patterns identified in the form of
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e089.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer clusters using the MCODE plug-in
<xref rid="pcbi.1000761-Bader1" ref-type="bibr">[9]</xref>
in Cytoscape are shown in different colors, with each subgraph summarized by a PSSM generated through hierarchical sequence agglomeration of the corresponding
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e090.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers. Both the ELT-1 motif GATAA and the bHLH motif ACGTG are found in this way. (B) Distribution of distance of motif ACGTG to TSS (measured in base pairs) in sperm genes versus non-sperm genes. Motif ACGTG occurs more frequently within 200bp upstream of the TSS in sperm genes relative to non-sperm genes, giving us more confidence in its contribution to sperm gene expression.</p>
</caption>
<graphic xlink:href="pcbi.1000761.g004"></graphic>
</fig>
<fig id="pcbi-1000761-g005" position="float">
<object-id pub-id-type="doi">10.1371/journal.pcbi.1000761.g005</object-id>
<label>Figure 5</label>
<caption>
<title>Oocyte motifs determined by graph-mer analysis and conservation of graph-mer derived oocyte and sperm motifs.</title>
<p>(A) Top 50
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e091.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers ranked by the weight vector
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e092.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, depicted as a graph-mer, which are associated by the PLS procedure to the expression pattern of oocyte genes. Graph motif patterns were identified in the form of
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e093.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer clusters using the MCODE plug-in in Cytoscape. PSSMs generated through hierarchical sequence agglomeration of the corresponding
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e094.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer sets are indicated, revealing several CG-rich motifs. (B) Analysis of oocyte
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e095.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer conservation using the motif conservation score (MCS). The plot shows the distribution of (oocyte MCS
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e096.jpg" mimetype="image"></inline-graphic>
</inline-formula>
non-oocyte MCS) for top 50
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e097.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers versus remaining
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e098.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers in
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e099.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. The score distribution for the top 50
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e100.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers has a heavy right tail, showing that as a distribution, the top 50
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e101.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers have higher oocyte-specific conservation scores as compared to other
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e102.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers (
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e103.jpg" mimetype="image"></inline-graphic>
</inline-formula>
e-13 by a one-sided KS statistic). Significantly conserved
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e104.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers are annotated, including CG-rich
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e105.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers for oocyte genes. (C) Distribution of (sperm MCS
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e106.jpg" mimetype="image"></inline-graphic>
</inline-formula>
non-sperm MCS) for top 50
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e107.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers versus remaining
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e108.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers in
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e109.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. The score distribution for the top 50
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e110.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers has a heavy right tail, showing that the top 50
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e111.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers have higher distribution of sperm-spefic conservation scores than other
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e112.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers (
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e113.jpg" mimetype="image"></inline-graphic>
</inline-formula>
e-5, one-sided KS statistic). Significantly conserved
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e114.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers are annotated, including ACGTG motif for sperm genes.</p>
</caption>
<graphic xlink:href="pcbi.1000761.g005"></graphic>
</fig>
<p>From the second factor, we successfully found the ELT-1 (‘erythrocyte-like transcription factor’) motif GATAA and bHLH (‘basic helix-loop-helix’) motif ACGTG, as shown in
<xref ref-type="fig" rid="pcbi-1000761-g004">Figure 4A</xref>
. The ELT-1 protein is a transcriptional activator that can recognize the GATA motif, is highly expressed in the germ line, and has as potential targets a number of genes encoding major sperm proteins
<xref rid="pcbi.1000761-Shim1" ref-type="bibr">[10]</xref>
. The bHLH proteins act through E-box elements with consensus CANNTG; the canonical E-box is CACGTG. bHLH proteins have been found to act at the E-box and influence hormone-induced promoter activation in mammalian Sertoli cells, which are required to maintain the process of spermatogenesis
<xref rid="pcbi.1000761-J1" ref-type="bibr">[11]</xref>
; however, this motif has not previously been associated with spermatogenesis in
<italic>C. elegans</italic>
.</p>
<p>For the first latent factor, the top ranked motifs are CG-rich sequences as shown in
<xref ref-type="fig" rid="pcbi-1000761-g005">Figure 5A</xref>
, which are highly enriched in oocyte gene promoters (
<xref ref-type="supplementary-material" rid="pcbi.1000761.s002">Figure S2</xref>
), suggesting a potential role in oogenesis or regulation of oocyte gene expression. We found further evidence supporting the functional relevance of learned motifs for the first two latent factors by performing gene set enrichment analysis, which showed that oocyte and sperm gene sets are enriched in the corresponding
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e115.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer hits (
<xref ref-type="supplementary-material" rid="pcbi.1000761.s001">Figure S1</xref>
(C,D)).</p>
</sec>
<sec id="s2f">
<title>Positional bias and conservation of motifs</title>
<p>Since functional motifs sometimes exhibit a spatial bias in the promoter region – for example, overrepresentation close to the transcription start site (TSS) – we performed positional analysis of top ranked motifs by examining their distance to the TSS in sperm genes versus non-sperm genes. We observed that the sequence element ACGTG displayed strong positional bias towards the TSS of sperm genes.
<xref ref-type="fig" rid="pcbi-1000761-g004">Figure 4B</xref>
plots the distribution of distance of ACGTG to TSS in sperm genes versus non-sperm genes, showing that ACGTG is found far more frequently within 200bp upstream of the TSS of sperm genes but displays a fairly uniform distribution relative to TSS in non-sperm genes. This result indicates that motif ACGTG was significantly overrepresented immediately upstream of sperm genes, giving us additional confidence in the motif's contribution to sperm gene expression.</p>
<p>To look for evidence of the functional roles of CG-rich and other highly weighted motifs, we considered conservation patterns of these sequences.
<italic>Caenorhabditis briggsae</italic>
is closely related to
<italic>C. elegans</italic>
and is frequently used in comparative genomics studies in worm. One expects that motifs responsible for a biological function that is shared by the two species, such as oogenesis, would be under evolutionary pressure and therefore conserved in the promoter regions of orthologous genes contributing to this function. We studied the conservation of all
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e116.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers between the two species and found that highly ranked
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e117.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers, where rankings are induced by the 1st and 2nd factor, tended to be more conserved in the oocyte genes and sperm genes, respectively. Specifically, we computed the motif conservation score (MCS)
<xref rid="pcbi.1000761-Waterston1" ref-type="bibr">[12]</xref>
of each
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e118.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer by comparing its conservation rate
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e119.jpg" mimetype="image"></inline-graphic>
</inline-formula>
to its expected rate
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e120.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, estimated using 500 random
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e121.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers of the same length. A conserved occurrence of a
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e122.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer is an instance of the
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e123.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer in the
<italic>C. elegans</italic>
genome, for which it is also present in the
<italic>C. briggsae</italic>
ortholog. We reported MCS as a Z-score (
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e124.jpg" mimetype="image"></inline-graphic>
</inline-formula>
) measuring the significance of observing
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e125.jpg" mimetype="image"></inline-graphic>
</inline-formula>
conserved occurrences out of total
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e126.jpg" mimetype="image"></inline-graphic>
</inline-formula>
occurrences. To assess the significance of inferred
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e127.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers for oocyte and sperm gene sets, we focused on motif conservation in sperm and oocyte genes relative to non-sperm and non-oocyte genes. To do this, we computed the MCS of each
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e128.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer in both oocyte genes and non-oocyte genes, and we plotted the distribution of the difference of these two MCS scores for top 50 ranked
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e129.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers in the
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e130.jpg" mimetype="image"></inline-graphic>
</inline-formula>
versus remaining
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e131.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers, as shown in
<xref ref-type="fig" rid="pcbi-1000761-g005">Figure 5B</xref>
, bottom panel; similarly,
<xref ref-type="fig" rid="pcbi-1000761-g005">Figure 5C</xref>
shows the difference of the MCS scores for sperm genes and non-sperm genes for the top 50 ranked
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e132.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers in
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e133.jpg" mimetype="image"></inline-graphic>
</inline-formula>
versus the remaining
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e134.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers. For both oocyte and sperm gene sets, the score distribution for the top 50 k-mers has a heavy right tail relative to other
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e135.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers, showing that the top
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e136.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers have higher oocyte- and sperm-specific conservation. To confirm the significance of this observation, we performed a one-sided Kolmogorov-Smirnov (KS) test and found that the rightward shift was highly significant in both cases (
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e137.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-13 and
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e138.jpg" mimetype="image"></inline-graphic>
</inline-formula>
e-5 for oocyte and sperm
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e139.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers, respectively). The
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e140.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers that are most significantly conserved in oocyte and sperm genes, relative to non-oocyte and non-sperm genes, are also annotated in
<xref ref-type="fig" rid="pcbi-1000761-g005">Figure 5B and 5C</xref>
; these include the ACGTG motif for sperm genes and CG-rich
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e141.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers for oocyte genes.</p>
</sec>
<sec id="s2g">
<title>Targets of CG-rich motifs are expressed in the germline</title>
<p>Relatively little is known about transcriptional regulation of oocyte genes. To gain additional evidence supporting a functional role for learned motifs, we examined the
<italic>in situ</italic>
expression patterns of genes enriched with those motifs. We searched for a subset of EST (expressed sequence tag) clones known as YK clones of each gene in WormBase (
<ext-link ext-link-type="uri" xlink:href="http://www.wormbase.org">http://www.wormbase.org</ext-link>
) and looked at
<italic>in situ</italic>
expression patterns at the L4-adult stage associated with each YK clone in the Nematode Expression Pattern Database (NEXTDB
<ext-link ext-link-type="uri" xlink:href="http://nematode.lab.nig.ac.jp/db2/index.php">http://nematode.lab.nig.ac.jp/db2/index.php</ext-link>
).</p>
<p>The
<italic>in situ</italic>
analysis provides direct evidence about where the genes are expressed, and we expect that genes highly ranked by motif hits are more likely to be germline expressed. To obtain a ranked gene list for each of the three motifs in
<xref ref-type="fig" rid="pcbi-1000761-g005">Figure 5A</xref>
, we first defined the gene group associated with the first factor based on
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e142.jpg" mimetype="image"></inline-graphic>
</inline-formula>
values (see
<xref ref-type="sec" rid="s4">Methods</xref>
). For each motif, we ranked genes within the gene group by counts of
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e143.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers of that motif and came up with a list consisting of top
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e144.jpg" mimetype="image"></inline-graphic>
</inline-formula>
80 genes.
<xref ref-type="table" rid="pcbi-1000761-t001">Table 1</xref>
summarizes the
<italic>in situ</italic>
expression patterns of genes associated with motif 1 (GGCGC), motif 2 (GCGCG) and motif 3 (ACCGTA). We split each gene list into two groups, those already known to be oocyte genes, and genes with high motif scores not already defined as oocyte genes. For each group,
<xref ref-type="table" rid="pcbi-1000761-t001">Table 1</xref>
shows number of genes examined; the number of genes with an
<italic>in situ</italic>
pattern; and percentage of genes expressed in germline tissues only, in both germline and somatic tissues, and somatic tissues only.</p>
<table-wrap id="pcbi-1000761-t001" position="float">
<object-id pub-id-type="doi">10.1371/journal.pcbi.1000761.t001</object-id>
<label>Table 1</label>
<caption>
<title>
<italic>In situ</italic>
analysis of genes enriched with CG-rich motifs.</title>
</caption>
<alternatives>
<graphic id="pcbi-1000761-t001-1" xlink:href="pcbi.1000761.t001"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">Motif</td>
<td align="left" rowspan="1" colspan="1">Previously identified as oocyte genes</td>
<td align="left" rowspan="1" colspan="1"># genes</td>
<td align="left" rowspan="1" colspan="1"># genes with
<italic>in situ</italic>
pattern</td>
<td align="left" rowspan="1" colspan="1">% Germline only</td>
<td align="left" rowspan="1" colspan="1">% Germline & somatic</td>
<td align="left" rowspan="1" colspan="1">% Somatic only</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Motif 1</td>
<td align="left" rowspan="1" colspan="1">yes</td>
<td align="left" rowspan="1" colspan="1">29</td>
<td align="left" rowspan="1" colspan="1">28</td>
<td align="left" rowspan="1" colspan="1">71%</td>
<td align="left" rowspan="1" colspan="1">7%</td>
<td align="left" rowspan="1" colspan="1">5%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">(GGCGC)</td>
<td align="left" rowspan="1" colspan="1">no</td>
<td align="left" rowspan="1" colspan="1">52</td>
<td align="left" rowspan="1" colspan="1">37</td>
<td align="left" rowspan="1" colspan="1">73%</td>
<td align="left" rowspan="1" colspan="1">8%</td>
<td align="left" rowspan="1" colspan="1">13%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Motif 2</td>
<td align="left" rowspan="1" colspan="1">yes</td>
<td align="left" rowspan="1" colspan="1">31</td>
<td align="left" rowspan="1" colspan="1">25</td>
<td align="left" rowspan="1" colspan="1">80%</td>
<td align="left" rowspan="1" colspan="1">4%</td>
<td align="left" rowspan="1" colspan="1">4%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">(GCGCG)</td>
<td align="left" rowspan="1" colspan="1">no</td>
<td align="left" rowspan="1" colspan="1">55</td>
<td align="left" rowspan="1" colspan="1">43</td>
<td align="left" rowspan="1" colspan="1">74%</td>
<td align="left" rowspan="1" colspan="1">14%</td>
<td align="left" rowspan="1" colspan="1">5%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Motif 3</td>
<td align="left" rowspan="1" colspan="1">yes</td>
<td align="left" rowspan="1" colspan="1">26</td>
<td align="left" rowspan="1" colspan="1">16</td>
<td align="left" rowspan="1" colspan="1">94%</td>
<td align="left" rowspan="1" colspan="1">0%</td>
<td align="left" rowspan="1" colspan="1">0%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">(ACCGTA)</td>
<td align="left" rowspan="1" colspan="1">no</td>
<td align="left" rowspan="1" colspan="1">62</td>
<td align="left" rowspan="1" colspan="1">38</td>
<td align="left" rowspan="1" colspan="1">76%</td>
<td align="left" rowspan="1" colspan="1">10%</td>
<td align="left" rowspan="1" colspan="1">0%</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="nt101">
<label></label>
<p>For each graph-mer derived motif, we identified the set of genes associated to the motif based on latent factor analysis (see
<xref ref-type="sec" rid="s4">Methods</xref>
). Each gene list was further split into two sets: genes that had been previously identified as oocyte genes based on mutant expression data and those not identified as oocyte genes by this previous analysis. The table shows the number of genes associated to the motif; the number of genes having an
<italic>in situ</italic>
pattern in the NEXTDB database; and genes expressed in germline tissues only, in both germline and somatic tissues, and somatic tissues only as a percentage of genes with an
<italic>in situ</italic>
pattern. The results show that even among genes not previously identified as oocyte genes, more than 70% of genes examined were dominantly expressed in germline tissues rather than somatic tissues. This percentage is much higher than seen overall for genes that were not previously called oocyte or sperm without considering motif information (20%), suggesting a functional role of CG-rich motifs in germline expression.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>Over all three motifs, 7% of the genes have detectable
<italic>in situ</italic>
staining. Of those, an average of 78% stain only in the germ line, and with more than 80% of genes previously identified as oocyte genes staining in the germ line.</p>
<p>More than 70% of genes that had not previously been identified as oocyte genes (based on mutant expression profiling) were also dominantly expressed in germline tissues rather than somatic tissues. In the study that defined the oocyte and sperm gene sets
<xref rid="pcbi.1000761-Reinke1" ref-type="bibr">[7]</xref>
, about 20% of genes that were not identified as oocyte or sperm had the germline expression by
<italic>in situ</italic>
analysis.
<xref ref-type="table" rid="pcbi-1000761-t001">Table 1</xref>
shows that for the genes that were associated with oocyte motifs 1, 2 and 3 via latent factor analysis – but had not previously been identified as oocyte genes – 37/52, 43/55, and 38/62 showed germline expression. All these proportions are very significantly higher than the background percentage of 20% (
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e145.jpg" mimetype="image"></inline-graphic>
</inline-formula>
e-16 for all motifs by a proportions test). These results provide additional evidence that we are learning functional motifs that contribute to germline expression.</p>
</sec>
<sec id="s2h">
<title>Comparison with principal component analysis</title>
<p>Principal component analysis (PCA) is a widely used dimensionality reduction technique that extracts from the data matrix a sequence of orthogonal vectors, or principal components, that capture the directions of maximal variance in the input data. PCA is frequently used on either rows (genes) or columns (experiments) of a gene expression matrix for visualization or preprocessing prior to other kinds of analysis
<xref rid="pcbi.1000761-Raychaudhuri1" ref-type="bibr">[13]</xref>
. By contrast, PLS is a supervised method that, in our context, determines weight vectors
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e146.jpg" mimetype="image"></inline-graphic>
</inline-formula>
as directions in gene expression space having maximal covariance with latent factors in motif space. Both PCA components and PLS weight vectors are interpreted as gene expression patterns. However, principal components are learned from gene expression data only, while weight vectors
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e147.jpg" mimetype="image"></inline-graphic>
</inline-formula>
are found based on a linear mapping from motif space to gene expression space.</p>
<p>We were interested in comparing our (graph-regularized) PLS results with standard PCA in order to assess the value added by the motif information and supervised learning formulation. We anticipated some concordance of results, since directions that capture little variance in the expression data will also fail to have significant covariance with motif latent factors.
<xref ref-type="fig" rid="pcbi-1000761-g006">Figure 6A and 6B</xref>
plot the first four PCA components versus PLS weight vectors. The first and second PCA components indeed bear some similarity to the first and second PLS weight vectors and to some extent resemble the oocyte and sperm gene expression patterns, respectively. Since these two gene sets are fairly large and follow distinct expression patterns, they account for a significant portion of gene expression variance, and so it is not surprising that the first PCs show correlation with these patterns. However, all the principal components are less smooth, as expression trajectories, than their corresponding PLS weight vectors, and the smoothness of the PCs deteriorates more rapidly than in PLS as the number of principal components/latent factors increases. It therefore appears that PLS uses motif information to provide some degree of regularization on the weight vectors, leading to smoother expression patterns corresponding to latent factors.</p>
<fig id="pcbi-1000761-g006" position="float">
<object-id pub-id-type="doi">10.1371/journal.pcbi.1000761.g006</object-id>
<label>Figure 6</label>
<caption>
<title>Comparison of PCA components and PLS expression weight vectors in gene expression space.</title>
<p>The first and second principal components bear some similarity to corresponding PLS weight vectors
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e148.jpg" mimetype="image"></inline-graphic>
</inline-formula>
,
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e149.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, but all principal components are less smooth than in PLS. (A) PCA identifies the first four directions (PC
<sub>1</sub>
, PC
<sub>2</sub>
, PC
<sub>3</sub>
and PC
<sub>4</sub>
) that have maximal variance in gene expression space. Principal components are plotted v.s. time. (B) Graph-regularized PLS learns weight vectors (
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e150.jpg" mimetype="image"></inline-graphic>
</inline-formula>
,
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e151.jpg" mimetype="image"></inline-graphic>
</inline-formula>
,
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e152.jpg" mimetype="image"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e153.jpg" mimetype="image"></inline-graphic>
</inline-formula>
) based on a linear mapping from motif space to gene expression space. Weight vectors are plotted vs. time.</p>
</caption>
<graphic xlink:href="pcbi.1000761.g006"></graphic>
</fig>
<p>To confirm that the PLS-derived motifs could not be determined from analysis of the first and second principal components (PC
<sub>1</sub>
and PC
<sub>2</sub>
), we performed the following motif discovery procedure: we identified the sets of genes that are highly correlated with PC
<sub>1</sub>
and PC
<sub>2</sub>
, and ran the AlignACE motif discovery program on the promoters of these genes, yielding 58 and 89 motifs, respectively (see
<xref ref-type="supplementary-material" rid="pcbi.1000761.s007">Text S1</xref>
). In both cases, the top-ranked motifs were dominated by AA-rich and GG-rich motifs that likely come from low complexity regions (
<xref ref-type="supplementary-material" rid="pcbi.1000761.s005">Figure S5</xref>
). A few CG-rich motifs appear in the AlignACE list for PC
<sub>1</sub>
, but with relatively low MAP scores; only one motif from the list for PC
<sub>2</sub>
matches any of the PLS-derived sperm motifs, and it occurs low in the ranking (rank = 33) with relatively weak MAP score. We conclude that analysis of the principle components does not retrieve the full motif information discovered by the PLS latent factors. This result underscores the importance of our predictive framework, mapping sequence to expression, rather than relying on correlation with expression and performing motif analysis after the fact.</p>
<p>Since the third and fourth PLS latent factors represent much smoother and quite different expression patterns than their PCS counterparts, we examined whether the genes associated to these factors based on motif and expression similarity (see
<xref ref-type="sec" rid="s4">Methods</xref>
) may have common functions. While there were few genes associated to the fourth PLS factor (18 genes) showed no enrichment for GO terms, the gene set for the third PLS factor was significantly enriched for 54 GO terms (using a threshold of
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e154.jpg" mimetype="image"></inline-graphic>
</inline-formula>
e-4, uncorrected hypergeometric people), of which the majority involved metabolism (32/54) and almost half of these were specific to amino acid metabolism (15/54). These genes are not enriched for germline expression, suggesting that our analysis has uncovered an independent co-regulation of a set of gene functions that might have been swamped out by the stronger germline information using other techniques.</p>
</sec>
<sec id="s2i">
<title>Comparison with clustering</title>
<p>Finally, we compared our results with standard cluster-first analysis, using hierarchical clustering to identify 5 distinct gene clusters and applying the AlignACE motif discovery program to the promoters of each cluster in order to find over-represented motifs (
<xref ref-type="supplementary-material" rid="pcbi.1000761.s007">Text S1</xref>
). We identified two clusters (Clusters 1, 2) with subtly different expression patterns both resembling the expression signature of oocyte genes and one cluster (Cluster 3) similar to the sperm gene expression signature (
<xref ref-type="supplementary-material" rid="pcbi.1000761.s006">Figure S6</xref>
(A,B,C)). AlignACE returned lists of 47, 53 and 36 motifs for these three clusters, and as in the principal component analysis, the top ranked motifs in all cases were dominated by low-complexity AA-rich and GG-rich motifs (
<xref ref-type="supplementary-material" rid="pcbi.1000761.s006">Figure S6</xref>
(D,E,F)). A handful of low-ranked motifs with relatively poor MAP scores for Clusters 1 and 2 resembled two of the CG-rich
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e155.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers identified through the first PLS latent factor; for Cluster 3, none of the AlignACE motifs were similar to the sperm-specific
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e156.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers identified by the second PLS latent factor (
<xref ref-type="supplementary-material" rid="pcbi.1000761.s007">Text S1</xref>
). We conclude first that PLS avoids many presumably spurious motifs from low complexity regions while finding true germline-specific motifs that are missed through standard cluster-based analysis.</p>
</sec>
</sec>
<sec id="s3">
<title>Discussion</title>
<p>There have still been relatively few methods that integrate mRNA expression and promoter sequence data beyond “cluster-first” motif discovery. Beer and Tavazoie
<xref rid="pcbi.1000761-Beer1" ref-type="bibr">[14]</xref>
similarly sought to reverse the information flow implied by clustering, to see how well motif content could predict expression patterns; in their case, however, expression patterns were identified with static clusters, motifs were discovered based on these clusters, and the learning task was the prediction of cluster membership rather than vector-valued expression profile. Ernst et al.
<xref rid="pcbi.1000761-Ernst1" ref-type="bibr">[15]</xref>
proposed a time-ordered hierarchical model for integrating motif and time series expression data, where motifs were associated with up/down bifurcations of expression profiles at particular time points; this method used static motif data rather than learning motifs. Segal et al.
<xref rid="pcbi.1000761-Segal1" ref-type="bibr">[16]</xref>
combined promoter sequence and expression data within a probabilistic relational models framework to learn “modules” supported by both data sources; rather than learning motifs de novo, the algorithm was seeded with database motifs which could then be refined during expectation-maximization iterations. In our own previous work on the MEDUSA algorithm
<xref rid="pcbi.1000761-Middendorf1" ref-type="bibr">[17]</xref>
, we discretized expression data and used a boosting-based algorithm to discover motifs and assemble a regulatory program that predicts up/down expression of target genes. MEDUSA is well-suited to perturbation experiments and performs well even for small perturbation data sets
<xref rid="pcbi.1000761-Kundaje1" ref-type="bibr">[18]</xref>
. In the current setting, where expression levels in consecutive time points are highly correlated and expression trajectories are smooth over time, discretizing the expression levels incurs a significant loss of signal, which we avoid by moving to a regression framework.</p>
<p>There have been several other regression based motif discovery approaches related to our work. For example, REDUCE
<xref rid="pcbi.1000761-Bussemaker1" ref-type="bibr">[19]</xref>
was the original method to use correlation between
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e157.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers and differential expression for motif discovery. REDUCE, however, uses each experiment independently, where we use multivariate PLS to treat full expression trajectories as the output space. To weight the benefits of regression with a multivariate output, we also tried fitting a separate graph-regularized univariate PLS model on each time point separately. We found that multivariate PLS outperforms univariate PLS (
<xref ref-type="supplementary-material" rid="pcbi.1000761.s003">Figure S3</xref>
), suggesting that correlating motifs with full expression patterns is more statistically accurate than performing regression one experiment at a time, at least in the case of correlated experiments such as time series data. Moreover, there was substantial overlap in the motif information inferred from nearby time points (see
<xref ref-type="supplementary-material" rid="pcbi.1000761.s007">Text S1</xref>
), showing that fitting a separate model for each time point entails a good deal of redundancy.</p>
<p>More recently, Zhang et al.
<xref rid="pcbi.1000761-Zhang1" ref-type="bibr">[20]</xref>
used PCA to define a basis of univariate response variables in the output space and then performed a REDUCE-like regression onto each variable to collect a set of motifs. In our work, by doing multivariate regression, we retain more structure in the solution, for example, a stratification of the output space by images of latent factors, each one corresponding to a characteristic time expression profile. We also note that lasso regression has been used elsewhere for learning regulatory networks in bacteria using time course expression data
<xref rid="pcbi.1000761-Bonneau1" ref-type="bibr">[21]</xref>
, and standard PLS has been used with a collection of known motifs in linear modeling of expression data in yeast and bacteria
<xref rid="pcbi.1000761-Brilli1" ref-type="bibr">[22]</xref>
. Finally, graph-based motif representations have been used previously by other groups, for example Naughton et al.
<xref rid="pcbi.1000761-Naughton1" ref-type="bibr">[23]</xref>
, but this work again falls into the “cluster-first” category in that it seeks to find overrepresented motifs for a predefined gene set. By contrast, we learn motifs via a global regression problem, and the graph structure is encoded as a constraint on the solution.</p>
<p>A number of recent studies have expanded beyond the linear regression framework by introducing various kinds of non-linearity. First, various authors have extended standard linear models by proposing that certain sets of motifs have synergistic effects. For example, a synergistic pair of TFs can be modeled by including a term in the regression model for each of the individual motif counts as well as a third term for the product of the counts, as recently reviewed
<xref rid="pcbi.1000761-Das1" ref-type="bibr">[24]</xref>
. However, introducing too many of these additional non-linear terms greatly increases the risk of overfitting; for a typical pair of TFs, the count of co-occurrences is simply too sparse to estimate the synergistic parameter. These models require careful feature selection strategies; moreover, they mostly assume that the motifs are known and fixed, whereas we are performing
<italic>de novo</italic>
motif discovery. Second, motivated by biochemical models, several studies propose that the relationship between motif counts or TF occupancy scores (in the case of PSSMs) and log expression change is not linear and make use of a non-linear transfer function. Recent work using a probabilistic framework to predict the 1D anterior-posterior positioning of expression “stripes” in the early Drosophila embryo from
<italic>cis</italic>
regulatory module (CRM) sequences can be seen as an elegant example of this idea
<xref rid="pcbi.1000761-Segal2" ref-type="bibr">[25]</xref>
. In this case, a logistic transfer function converts occupancy scores, computed from the space of configurations of TFs in the CRM, into sharp stripe boundaries. In our setting, however, we are learning from microarray expression data, which gives average (and noisy) measurements over a large population of cells with large underlying variation of expression levels. It is unclear whether mRNA expression data allows us to observe and model biochemically-expected non-linearity in this situation. Third, when confronted with a multi-variate response, such as in time series expression profiles, some authors have used a model where each motif count/occupancy score contributes linearly to the expression pattern at each time point (as we do) but the time points are connected by use of non-linear basis functions such as splines
<xref rid="pcbi.1000761-Wang1" ref-type="bibr">[26]</xref>
. However, we find that the smoothness of the PLS-derived expression patterns comes for free as a result of the regularization choices in our method, so in our hands the smoothness prior did not seem to be statistically necessary.</p>
<p>Finally, our method can be applied to even more sparsely sampled time series covering a broader range of developmental stages. As a proof-of-principle, we applied graph-regularized PLS to a full life cycle
<italic>C. elegans</italic>
developmental time course consisting whole-animal gene expression profiles from egg to adult
<xref rid="pcbi.1000761-Hill1" ref-type="bibr">[27]</xref>
(see
<xref ref-type="supplementary-material" rid="pcbi.1000761.s007">Text S1</xref>
). In this setting, the first latent factor contained germline-specific motifs similar to the ones found in the analysis of our main data set, while the next second and third latent factors were associated with more diverse biological functions (
<xref ref-type="supplementary-material" rid="pcbi.1000761.s004">Figure S4</xref>
). These results suggest that our approach can discover the structure of gene regulatory programs, in the form of latent factors corresponding to sequence patterns and expression trajectories, at a range of developmental time scales.</p>
</sec>
<sec sec-type="materials|methods" id="s4">
<title>Materials and Methods</title>
<sec id="s4a">
<title>Standard partial least squares regression</title>
<p>Since our algorithm builds on ideas from PLS regression, we first describe how to use standard PLS to iteratively learn a linear mapping from the promoter sequences of genes, as represented by their
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e158.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer counts, and their mRNA expression profiles. Formally, using a training set of
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e159.jpg" mimetype="image"></inline-graphic>
</inline-formula>
genes, PLS takes a motif matrix
<bold>X</bold>
(dimension
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e160.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, where
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e161.jpg" mimetype="image"></inline-graphic>
</inline-formula>
is the number of
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e162.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers), representing the individual
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e163.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer counts for each gene, and a gene expression matrix by
<bold>Y</bold>
(dimension
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e164.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, where
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e165.jpg" mimetype="image"></inline-graphic>
</inline-formula>
is the number of experiments). Here, the columns of
<bold>X</bold>
represent the independent variables (features) and the columns of
<bold>Y</bold>
are the response variables; we also call
<bold>X</bold>
the input matrix and
<bold>Y</bold>
the output matrix. PLS then performs the following steps:</p>
<list list-type="alpha-lower">
<list-item>
<p>Scale
<bold>X</bold>
and
<bold>Y</bold>
so that each column of the input and output matrices has zero mean and unit variance.</p>
</list-item>
<list-item>
<p>Perform dimensionality reduction by construction of latent factors
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e166.jpg" mimetype="image"></inline-graphic>
</inline-formula>
: Construct
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e167.jpg" mimetype="image"></inline-graphic>
</inline-formula>
<italic>weight</italic>
vectors, placed as column vectors in
<bold>W</bold>
(dimension
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e168.jpg" mimetype="image"></inline-graphic>
</inline-formula>
), and corresponding
<italic>latent</italic>
factors, placed as column vectors in
<bold>T</bold>
(dimension
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e169.jpg" mimetype="image"></inline-graphic>
</inline-formula>
), where the weight vectors are chosen so that the latent factors have maximal covariance with directions in the multivariate response
<bold>Y</bold>
.</p>
</list-item>
<list-item>
<p>Use the latent factors
<bold>T</bold>
to predict
<bold>Y</bold>
: Regress
<bold>Y</bold>
against the latent factors using ordinary least squares (or ridge) regression,
<disp-formula>
<graphic xlink:href="pcbi.1000761.e170.jpg" mimetype="image" position="float"></graphic>
</disp-formula>
</p>
</list-item>
<list-item>
<p>Obtain the matrix
<bold>B</bold>
of regression coefficients:
<disp-formula>
<graphic xlink:href="pcbi.1000761.e171.jpg" mimetype="image" position="float"></graphic>
</disp-formula>
</p>
</list-item>
</list>
<p>We split genes into test and training sets for cross validation experiments. Training data including motif matrix
<bold>X</bold>
and gene expression matrix
<bold>Y</bold>
were used to learn matrix of regression coefficients
<bold>B</bold>
. And we assessed predictive power of PLS on test data
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e172.jpg" mimetype="image"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e173.jpg" mimetype="image"></inline-graphic>
</inline-formula>
by normalized mean squared error (NMSE):
<disp-formula>
<graphic xlink:href="pcbi.1000761.e174.jpg" mimetype="image" position="float"></graphic>
<label>(1)</label>
</disp-formula>
where
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e175.jpg" mimetype="image"></inline-graphic>
</inline-formula>
denotes the expected value and
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e176.jpg" mimetype="image"></inline-graphic>
</inline-formula>
.</p>
<p>PLS not only provides a solution to the regression problem, but it also describes the covariance structure between
<bold>X</bold>
and
<bold>Y</bold>
. It constructs
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e177.jpg" mimetype="image"></inline-graphic>
</inline-formula>
weight vectors
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e178.jpg" mimetype="image"></inline-graphic>
</inline-formula>
in the input space
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e179.jpg" mimetype="image"></inline-graphic>
</inline-formula>
and corresponding vectors
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e180.jpg" mimetype="image"></inline-graphic>
</inline-formula>
in the output space
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e181.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, where
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e182.jpg" mimetype="image"></inline-graphic>
</inline-formula>
is maximal. Intuitively, each weight vector
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e183.jpg" mimetype="image"></inline-graphic>
</inline-formula>
corresponds to a set of motifs (
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e184.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers) that helps explain expression patterns in the direction
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e185.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. The
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e186.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers with largest coefficients in
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e187.jpg" mimetype="image"></inline-graphic>
</inline-formula>
are the most important variables for predicting the projection of the expression patterns of genes onto
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e188.jpg" mimetype="image"></inline-graphic>
</inline-formula>
.</p>
</sec>
<sec id="s4b">
<title>SIMPLS algorithm</title>
<p>There are a number of variants of PLS, each of which defines and solves an optimization problem for constructing the weight matrix
<bold>W</bold>
. We use the SIMPLS (Statistically Inspired Modification of PLS) algorithm
<xref rid="pcbi.1000761-Jong1" ref-type="bibr">[28]</xref>
, which optimizes an objective function defined on the matrix
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e189.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. The latent factors
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e190.jpg" mimetype="image"></inline-graphic>
</inline-formula>
in T are sequentially built by estimating weight vectors
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e191.jpg" mimetype="image"></inline-graphic>
</inline-formula>
as follows:</p>
<p>For
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e192.jpg" mimetype="image"></inline-graphic>
</inline-formula>
:</p>
<list list-type="alpha-lower">
<list-item>
<p>Maximize the covariance between
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e193.jpg" mimetype="image"></inline-graphic>
</inline-formula>
and
<bold>Y</bold>
:
<disp-formula>
<graphic xlink:href="pcbi.1000761.e194.jpg" mimetype="image" position="float"></graphic>
<label>(2)</label>
</disp-formula>
where
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e195.jpg" mimetype="image"></inline-graphic>
</inline-formula>
is a unit vector.</p>
</list-item>
<list-item>
<p>Impose orthogonality constraints
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e196.jpg" mimetype="image"></inline-graphic>
</inline-formula>
for all
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e197.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, by deflating
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e198.jpg" mimetype="image"></inline-graphic>
</inline-formula>
:
<disp-formula>
<graphic xlink:href="pcbi.1000761.e199.jpg" mimetype="image" position="float"></graphic>
<label>(3)</label>
</disp-formula>
where (i) If
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e200.jpg" mimetype="image"></inline-graphic>
</inline-formula>
,
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e201.jpg" mimetype="image"></inline-graphic>
</inline-formula>
.</p>
</list-item>
</list>
<p>(ii) If
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e202.jpg" mimetype="image"></inline-graphic>
</inline-formula>
,
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e203.jpg" mimetype="image"></inline-graphic>
</inline-formula>
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e204.jpg" mimetype="image"></inline-graphic>
</inline-formula>
.</p>
</sec>
<sec id="s4c">
<title>Regularized partial least squares regression</title>
<p>We now modify the PLS algorithm with the dual goals of (1) making the solution more interpretable and (2) regularizing the optimization problem, to reduce overfitting. We impose two constraints to achieve these goals. First, we use a lasso (
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e205.jpg" mimetype="image"></inline-graphic>
</inline-formula>
) constraint
<xref rid="pcbi.1000761-Tibshirani1" ref-type="bibr">[2]</xref>
to promote sparsity in the weight vectors
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e206.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, that is, drive the weights for many
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e207.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers to zero. Sparsity is clearly attractive since fewer
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e208.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers contribute to the solution, making it easier to identify the most important motifs. The lasso constraint over coordinates
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e209.jpg" mimetype="image"></inline-graphic>
</inline-formula>
of weight vector
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e210.jpg" mimetype="image"></inline-graphic>
</inline-formula>
takes the form:
<disp-formula>
<graphic xlink:href="pcbi.1000761.e211.jpg" mimetype="image" position="float"></graphic>
<label>(4)</label>
</disp-formula>
</p>
<p>For the second constraint, we want sequence-similar
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e212.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers to have similar coefficients in the weight vectors, so that a group of similar
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e213.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers are more likely to act as a single motif pattern in the regression problem. We define a graph structure on the
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e214.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers where we place an edge
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e215.jpg" mimetype="image"></inline-graphic>
</inline-formula>
if the Hamming distance between the pair of
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e216.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e217.jpg" mimetype="image"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e218.jpg" mimetype="image"></inline-graphic>
</inline-formula>
is less than threshold
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e219.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. Since
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e220.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers represent potential binding sites in double-stranded DNA, here we take the distance between two
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e221.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e222.jpg" mimetype="image"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e223.jpg" mimetype="image"></inline-graphic>
</inline-formula>
to be the minimum of the Hamming distances
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e224.jpg" mimetype="image"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e225.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, where
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e226.jpg" mimetype="image"></inline-graphic>
</inline-formula>
is the reverse complement of
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e227.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. We then impose a smoothness constraint in the form of the graph Laplacian
<xref rid="pcbi.1000761-Weinberger1" ref-type="bibr">[29]</xref>
, as described below. The Laplacian matrix
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e228.jpg" mimetype="image"></inline-graphic>
</inline-formula>
for an unweighted graph is defined as
<disp-formula>
<graphic xlink:href="pcbi.1000761.e229.jpg" mimetype="image" position="float"></graphic>
<label>(5)</label>
</disp-formula>
where
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e230.jpg" mimetype="image"></inline-graphic>
</inline-formula>
denotes the degree of
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e231.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e232.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, the number of edges that connect
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e233.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e234.jpg" mimetype="image"></inline-graphic>
</inline-formula>
with other
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e235.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers. If we write
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e236.jpg" mimetype="image"></inline-graphic>
</inline-formula>
as a column vector and view it as a function on the graph – i.e. a function that assigns a weight
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e237.jpg" mimetype="image"></inline-graphic>
</inline-formula>
to each vertex
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e238.jpg" mimetype="image"></inline-graphic>
</inline-formula>
– then we can use the graph Laplacian to compute a quadratic form on
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e239.jpg" mimetype="image"></inline-graphic>
</inline-formula>
that satisfies the relationship
<xref rid="pcbi.1000761-Chung1" ref-type="bibr">[30]</xref>
:
<disp-formula>
<graphic xlink:href="pcbi.1000761.e240.jpg" mimetype="image" position="float"></graphic>
<label>(6)</label>
</disp-formula>
Equation (6) shows that this quadratic form measures the
<italic>smoothness</italic>
of
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e241.jpg" mimetype="image"></inline-graphic>
</inline-formula>
with respect to the graph: the quadratic form is small when the function's values vary smoothly over adjacent nodes, so that the weights for sequence-similar
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e242.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers are close in value. Therefore, the second constraint that we impose is precisely on the size of the quadratic form, enforcing smoothness on the weight vector
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e243.jpg" mimetype="image"></inline-graphic>
</inline-formula>
:
<disp-formula>
<graphic xlink:href="pcbi.1000761.e244.jpg" mimetype="image" position="float"></graphic>
<label>(7)</label>
</disp-formula>
</p>
<p>A pseudocode description of the graph-regularized PLS algorithm is given in
<xref ref-type="fig" rid="pcbi-1000761-g007">Figure 7</xref>
.</p>
<fig id="pcbi-1000761-g007" position="float">
<object-id pub-id-type="doi">10.1371/journal.pcbi.1000761.g007</object-id>
<label>Figure 7</label>
<caption>
<title>Pseudocode for graph-regularized PLS.</title>
<p>A pseudocode description of the iterative PLS procedure, enforcing sparsity and Laplacian constraints on motif weight vectors.</p>
</caption>
<graphic xlink:href="pcbi.1000761.g007"></graphic>
</fig>
</sec>
<sec id="s4d">
<title>Filtering
<italic>k</italic>
-mer features</title>
<p>
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e245.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer features with very sparse genome-wide counts are unlikely to improve the loss function – since they only only in a handful of promoters – and can contribute to overfitting. In order to eliminate
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e246.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers with infrequent counts prior to training, we filtered the
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e247.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer feature set based on expected counts on background sequences. We constructed the background sequences by shuffling exon sequences 100 times and ranked
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e248.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers by the
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e249.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-score
<xref rid="pcbi.1000761-Eskin1" ref-type="bibr">[31]</xref>
:
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e250.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, where
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e251.jpg" mimetype="image"></inline-graphic>
</inline-formula>
is the number of the
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e252.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer in all promoter sequences, N is the length of all shuffled exon sequences, and
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e253.jpg" mimetype="image"></inline-graphic>
</inline-formula>
is number of the
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e254.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer in all shuffled exon sequences divided by N. (Note that shuffled intergenic sequences could also be used to generate the random model.) We kept the top
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e255.jpg" mimetype="image"></inline-graphic>
</inline-formula>
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e256.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers and built the motif matrix containing counts of
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e257.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers in promoter sequences. We found that this filtering step significantly improved cross-validation performance.</p>
</sec>
<sec id="s4e">
<title>Hierarchical sequence agglomeration</title>
<p>For each latent factor
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e258.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, we rank
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e259.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers by their components in the corresponding weight vector
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e260.jpg" mimetype="image"></inline-graphic>
</inline-formula>
and perform motif analysis on the top 50
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e261.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers. Those
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e262.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers are first displayed in the form of a motif graph via Cytoscape
<xref rid="pcbi.1000761-Shannon1" ref-type="bibr">[32]</xref>
, in which an edge between two
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e263.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer nodes indicates similarity. We used the MCODE Cytoscape Plugin
<xref rid="pcbi.1000761-Bader1" ref-type="bibr">[9]</xref>
to find
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e264.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer clusters (highly interconnected sets of sequence-similar
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e265.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers) in the graph. Each
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e266.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer cluster represents a motif pattern consisting of slightly different
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e267.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers.</p>
<p>Finally we perform a hierarchical sequence agglomeration algorithm to generate position-specific scoring matrices (PSSMs) for
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e268.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer clusters. Within each
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e269.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer cluster, each
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e270.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer is treated as a seed PSSM (using background nucleotide probabilities for smoothing), and then the algorithm iteratively merges similar PSSMs until a single PSSM is learned as the binding site model.</p>
<p>A position-specific scoring matrix (PSSM) is represented by a probability distribution
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e271.jpg" mimetype="image"></inline-graphic>
</inline-formula>
over sequences
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e272.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, where
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e273.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. The emission probabilities are assumed to be independent at every position such that
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e274.jpg" mimetype="image"></inline-graphic>
</inline-formula>
.</p>
<p>When comparing two PSSMs
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e275.jpg" mimetype="image"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e276.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, we allow offsets between their starting positions. We pad either the left or right ends with the background distribution and then define a distance measure
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e277.jpg" mimetype="image"></inline-graphic>
</inline-formula>
as the minimum over all possible position offsets of the JS entropy.
<disp-formula>
<graphic xlink:href="pcbi.1000761.e278.jpg" mimetype="image" position="float"></graphic>
<label>(8)</label>
</disp-formula>
where
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e279.jpg" mimetype="image"></inline-graphic>
</inline-formula>
is the Kullback-Leibler divergence. Given that the position-specific probabilities are independent, one can easily show that
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e280.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. The relative weights of the two PSSMs,
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e281.jpg" mimetype="image"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e282.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, are here defined as
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e283.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, where
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e284.jpg" mimetype="image"></inline-graphic>
</inline-formula>
are the numbers of target genes for the given PSSM. The initial PSSMs are
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e285.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mers and the number of target genes are the number of promoter sequences with the
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e286.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer occurrence. The number of target genes for the newly merged PSSM will be the number of target genes combined for the two old PSSMs.</p>
</sec>
<sec id="s4f">
<title>Assigning genes to latent factors</title>
<p>To extract biological information from the algorithm output, we analyzed latent factors for potential gene groups and corresponding biological functions. To do that, we assigned each gene
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e287.jpg" mimetype="image"></inline-graphic>
</inline-formula>
to the gene group associated with a factor
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e288.jpg" mimetype="image"></inline-graphic>
</inline-formula>
based on
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e289.jpg" mimetype="image"></inline-graphic>
</inline-formula>
values. Here, the matrix
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e290.jpg" mimetype="image"></inline-graphic>
</inline-formula>
(respectively,
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e291.jpg" mimetype="image"></inline-graphic>
</inline-formula>
) is formed by placing vectors
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e292.jpg" mimetype="image"></inline-graphic>
</inline-formula>
(respectively,
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e293.jpg" mimetype="image"></inline-graphic>
</inline-formula>
) for latent factors
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e294.jpg" mimetype="image"></inline-graphic>
</inline-formula>
as column vectors (
<xref ref-type="fig" rid="pcbi-1000761-g001">Figure 1</xref>
). The value
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e295.jpg" mimetype="image"></inline-graphic>
</inline-formula>
indicates how well
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e296.jpg" mimetype="image"></inline-graphic>
</inline-formula>
captures the
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e297.jpg" mimetype="image"></inline-graphic>
</inline-formula>
-mer profile of gene
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e298.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, and the value
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e299.jpg" mimetype="image"></inline-graphic>
</inline-formula>
measures the similarity between
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e300.jpg" mimetype="image"></inline-graphic>
</inline-formula>
and expression profile of gene
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e301.jpg" mimetype="image"></inline-graphic>
</inline-formula>
. In contrast to traditional clustering, which only relies on gene expression to group genes, we integrate both sequence and gene expression information in learning potentially functional gene sets. For each gene
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e302.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, we computed
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e303.jpg" mimetype="image"></inline-graphic>
</inline-formula>
across all factors and chose factor
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e304.jpg" mimetype="image"></inline-graphic>
</inline-formula>
with the maximum value:
<disp-formula>
<graphic xlink:href="pcbi.1000761.e305.jpg" mimetype="image" position="float"></graphic>
</disp-formula>
</p>
<p>Since we suspected that only large
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e306.jpg" mimetype="image"></inline-graphic>
</inline-formula>
values indicated strong association of a gene
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e307.jpg" mimetype="image"></inline-graphic>
</inline-formula>
with factor
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e308.jpg" mimetype="image"></inline-graphic>
</inline-formula>
, we assigned gene
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e309.jpg" mimetype="image"></inline-graphic>
</inline-formula>
to factor
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e310.jpg" mimetype="image"></inline-graphic>
</inline-formula>
only when
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e311.jpg" mimetype="image"></inline-graphic>
</inline-formula>
was in the top 20% of all
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e312.jpg" mimetype="image"></inline-graphic>
</inline-formula>
values. Although we use
<inline-formula>
<inline-graphic xlink:href="pcbi.1000761.e313.jpg" mimetype="image"></inline-graphic>
</inline-formula>
latent factors in our model, here we compute the representation with five factors, reasoning that if a gene is assigned to the 5th factor, it should not be included in our main analysis.</p>
</sec>
</sec>
<sec sec-type="supplementary-material" id="s5">
<title>Supporting Information</title>
<supplementary-material content-type="local-data" id="pcbi.1000761.s001">
<label>Figure S1</label>
<caption>
<p>Correspondence between first and second latent factors and sperm and oocyte genes. (A,B) The set of all genes is split into oocyte and non-oocyte genes, or sperm and non-sperm genes, and the empirical cumulative distribution of correlation with c
<sub>i</sub>
, i = 1,2 is plotted. Oocyte and sperm genes are enriched towards the top of the gene expression correlation distribution. (C,D) The set of all genes is split into oocyte and non-oocyte genes, or sperm and non-sperm genes, and the corresponding empirical cumulative distributions of hits of top 50
<italic>k</italic>
-mers in w
<italic>i</italic>
, i = 1,2 are plotted. Oocyte and sperm genes are enriched in
<italic>k</italic>
-mer hits corresponding to the 1st and 2nd weight vectors.</p>
<p>(3.03 MB TIF)</p>
</caption>
<media xlink:href="pcbi.1000761.s001.tif">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi.1000761.s002">
<label>Figure S2</label>
<caption>
<p>Correlation of weights with significance of enrichment in oocyte and sperm genes for the
<italic>k</italic>
-mers from 1st and 2nd graph-mer respectively. We plot the weights of
<italic>k</italic>
-mers in the first motif weight vector versus the −log
<sub>10</sub>
(p-value) for the enrichment of these
<italic>k</italic>
-mers in oocyte and sperm genes, as computed by the hypergeometric distribution. (A) For oocyte genes, −log
<sub>10</sub>
(p-value) is moderately correlated with w
<sub>1</sub>
(Pearson coefficient = 0.65), and
<italic>k</italic>
-mers highly ranked by w
<sub>1</sub>
had p-values between 10
<sup>−16</sup>
and 10
<sup>−4</sup>
. This enrichment supports the functional relevance of PLS-derived
<italic>k</italic>
-mers from the first factor in oocyte genes. (B) For sperm genes, −log
<sub>10</sub>
(p-value) is somewhat correlated with w
<sub>2</sub>
(Pearson coefficient = 0.35), though the correlation is weaker than that of oocyte genes.</p>
<p>(0.42 MB TIF)</p>
</caption>
<media xlink:href="pcbi.1000761.s002.tif">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi.1000761.s003">
<label>Figure S3</label>
<caption>
<p>Normalized mean squared prediction error on cross-validation test data. (A) Normalized mean squared error versus number of PLS iterations for standard univariate and multivariate PLS. At each iteration, standard univariate PLS learns twelve latent factors, corresponding to the twelve individual time points, while multivariate PLS learns one latent factor for all time points. Univariate PLS yielded a slightly lower test error than that of standard multivariate PLS after the 1st iteration; however, after one iteration, the univariate PLS corresponds to a collection of motif sets, each predicting a single experiment's gene expression changes, while multivariate PLS uses a single motif set to predict full gene expression trajectories. (B) Normalized mean squared error on test data by time point after the 1st univariate PLS iteration. Normalized mean squared error versus time point on all genes, oocyte and sperm gene sets. Univariate PLS reaches lowest prediction error on oocyte gene set at late time points when oocyte gene expression peaks. Similarly, prediction error on sperm gene set is small at middle time points when sperm gene expression peaks. Each time-specific univariate PLS models the motif-expression correspondence for the gene set differentially expressed at the given time point.</p>
<p>(0.40 MB TIF)</p>
</caption>
<media xlink:href="pcbi.1000761.s003.tif">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi.1000761.s004">
<label>Figure S4</label>
<caption>
<p>Latent factor analysis reveals graph-mers, expression patterns and significant associations of gene annotations. For each latent factor (i = 1…3), an associated mini graph-mer, extracted motif patterns and gene group are shown; annotations that are significantly enriched in each gene group are listed at the right (p<.001, uncorrected hypergeometric p-value), with p-values and number of genes associated with each annotation.</p>
<p>(2.02 MB TIF)</p>
</caption>
<media xlink:href="pcbi.1000761.s004.tif">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi.1000761.s005">
<label>Figure S5</label>
<caption>
<p>Motifs found by AlignACE in genes correlated with PC
<sub>1</sub>
and PC
<sub>2</sub>
. (A) Top 40 AlignACE motifs in genes correlated with PC
<sub>1</sub>
sorted by MAP score. Top ranked AA-rich and GG-rich motifs may result from low complexity regions, and several PCA motifs with relatively low MAP scores (e.g. MAP = 147.05, 90.77, 80.93) are similar to PLS 1st factor motifs. (B) Top 40 AlignACE motifs in genes correlated with PC
<sub>2</sub>
. Only one motif (MAP score = 101.03) is similar to our PLS sperm gene motif ACGTG from 2nd weight vector. None of the other PCA motifs matched any of the PLS 2nd factor motifs.</p>
<p>(7.58 MB TIF)</p>
</caption>
<media xlink:href="pcbi.1000761.s005.tif">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi.1000761.s006">
<label>Figure S6</label>
<caption>
<p>Motifs found by AlignACE in different gene clusters. (A) Expression patterns of genes in Cluster 1. (B) Expression patterns of genes in Cluster 2. (C) Expression patterns of genes in Cluster 3. (D) Top 40 AlignACE motifs found in Cluster 1 genes. (E) Top 40 AlignACE motifs found in Cluster 2 genes. (F) All 35 AlignACE motifs found in Cluster 3 genes.</p>
<p>(10.87 MB TIF)</p>
</caption>
<media xlink:href="pcbi.1000761.s006.tif">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi.1000761.s007">
<label>Text S1</label>
<caption>
<p>Supplementary results</p>
<p>(0.08 MB PDF)</p>
</caption>
<media xlink:href="pcbi.1000761.s007.pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<fn-group>
<fn fn-type="COI-statement">
<p>The authors have declared that no competing interests exist.</p>
</fn>
<fn fn-type="financial-disclosure">
<p>We would like to acknowledge the funding from NSF grant IIS-0705580 and NIH NCBC grant U54-CA121852 to Columbia University. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="pcbi.1000761-Tompa1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tompa</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Bailey</surname>
<given-names>TL</given-names>
</name>
<name>
<surname>Church</surname>
<given-names>GM</given-names>
</name>
<name>
<surname>De Moor</surname>
<given-names>BD</given-names>
</name>
<etal></etal>
</person-group>
<year>2005</year>
<article-title>Assessing computational tools for the discovery of transcription factor binding sites.</article-title>
<source>Nat Biotechnol</source>
<volume>23</volume>
<fpage>137</fpage>
<lpage>144</lpage>
<pub-id pub-id-type="pmid">15637633</pub-id>
</element-citation>
</ref>
<ref id="pcbi.1000761-Tibshirani1">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tibshirani</surname>
<given-names>R</given-names>
</name>
</person-group>
<year>1996</year>
<article-title>Regression shrinkage and selection via the lasso.</article-title>
<source>J R Stat Soc Series B</source>
<volume>58</volume>
<fpage>267</fpage>
<lpage>288</lpage>
</element-citation>
</ref>
<ref id="pcbi.1000761-Belkin1">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Belkin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Niyogi</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Sindhwani</surname>
<given-names>V</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>Manifold regularization: A geometric framework for learning from labeled and unlabeled examples.</article-title>
<source>JMLR</source>
<volume>7</volume>
<fpage>2399</fpage>
<lpage>2434</lpage>
</element-citation>
</ref>
<ref id="pcbi.1000761-Ng1">
<label>4</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Ng</surname>
<given-names>AY</given-names>
</name>
<name>
<surname>Jordan</surname>
<given-names>MI</given-names>
</name>
<name>
<surname>Weiss</surname>
<given-names>Y</given-names>
</name>
</person-group>
<year>2001</year>
<article-title>On spectral clustering: Analysis and an algorithm.</article-title>
<source>Advances in Neural Information Processing Systems 14</source>
<publisher-name>MIT Press</publisher-name>
<fpage>849</fpage>
<lpage>856</lpage>
</element-citation>
</ref>
<ref id="pcbi.1000761-Rapaport1">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rapaport</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Zinovyev</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Dutreix</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Barillot</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Vert</surname>
<given-names>JP</given-names>
</name>
</person-group>
<year>2007</year>
<article-title>Classification of microarray data using gene networks.</article-title>
<source>BMC Bioinformatics</source>
<volume>8</volume>
</element-citation>
</ref>
<ref id="pcbi.1000761-Bailey1">
<label>6</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Bailey</surname>
<given-names>TL</given-names>
</name>
<name>
<surname>Elkan</surname>
<given-names>C</given-names>
</name>
</person-group>
<year>1994</year>
<article-title>Fitting a mixture model by expectation maximization to discover motifs in biopolymers.</article-title>
<source>Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology</source>
<publisher-name>AAAI Press</publisher-name>
<fpage>28</fpage>
<lpage>36</lpage>
</element-citation>
</ref>
<ref id="pcbi.1000761-Reinke1">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Reinke</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Gil</surname>
<given-names>IS</given-names>
</name>
<name>
<surname>Ward</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Kazmer</surname>
<given-names>K</given-names>
</name>
</person-group>
<year>2004</year>
<article-title>Genome-wide germline-enriched and sex-biased expression profiles in Caenorhabditis elegans.</article-title>
<source>Development</source>
<volume>131</volume>
<fpage>311</fpage>
<lpage>323</lpage>
<pub-id pub-id-type="pmid">14668411</pub-id>
</element-citation>
</ref>
<ref id="pcbi.1000761-Boulesteix1">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Boulesteix</surname>
<given-names>AL</given-names>
</name>
<name>
<surname>Strimmer</surname>
<given-names>K</given-names>
</name>
</person-group>
<year>2007</year>
<article-title>Partial least squares: a versatile tool for the analysis of high-dimensional genomic data.</article-title>
<source>Brief Bioinform</source>
<volume>8</volume>
<fpage>32</fpage>
<lpage>44</lpage>
<pub-id pub-id-type="pmid">16772269</pub-id>
</element-citation>
</ref>
<ref id="pcbi.1000761-Bader1">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bader</surname>
<given-names>GD</given-names>
</name>
<name>
<surname>Hogue</surname>
<given-names>CWV</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>An automated method for finding molecular complexes in large protein interaction networks.</article-title>
<source>BMC Bioinformatics</source>
<volume>4</volume>
<fpage>2</fpage>
<pub-id pub-id-type="pmid">12525261</pub-id>
</element-citation>
</ref>
<ref id="pcbi.1000761-Shim1">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shim</surname>
<given-names>Y</given-names>
</name>
</person-group>
<year>1999</year>
<article-title>elt-1, a gene encoding a caenorhabditis elegans GATA transcription factor, is highly expressed in the germ lines with msp genes as the potential targets.</article-title>
<source>Mol Cells</source>
<volume>9</volume>
<fpage>535</fpage>
<lpage>541</lpage>
<pub-id pub-id-type="pmid">10597043</pub-id>
</element-citation>
</ref>
<ref id="pcbi.1000761-J1">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>J</surname>
<given-names>C</given-names>
</name>
<name>
<surname>K</surname>
<given-names>SM</given-names>
</name>
</person-group>
<year>1999</year>
<article-title>Basic helix-loop-helix proteins can act at the e-box within the serum response element of the c-fos promoter to influence hormone-induced promoter activation in sertoli cells.</article-title>
<source>Mol Endocrinol</source>
<volume>13</volume>
<fpage>774</fpage>
<lpage>86</lpage>
<pub-id pub-id-type="pmid">10319327</pub-id>
</element-citation>
</ref>
<ref id="pcbi.1000761-Waterston1">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Waterston</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Lindblad-Toh</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Birney</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Rogers</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Abril</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<year>2002</year>
<article-title>Initial sequencing and comparative analysis of the mouse genome.</article-title>
<source>Nature</source>
<volume>420</volume>
<fpage>520</fpage>
<lpage>562</lpage>
<pub-id pub-id-type="pmid">12466850</pub-id>
</element-citation>
</ref>
<ref id="pcbi.1000761-Raychaudhuri1">
<label>13</label>
<element-citation publication-type="other">
<person-group person-group-type="author">
<name>
<surname>Raychaudhuri</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Stuart</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Altman</surname>
<given-names>R</given-names>
</name>
</person-group>
<year>2000</year>
<article-title>Principal components analysis to summarize microarray experiments: application to sporulation time series.</article-title>
<source>Pac Symp Biocomput. volume 5</source>
<fpage>455</fpage>
<lpage>466</lpage>
</element-citation>
</ref>
<ref id="pcbi.1000761-Beer1">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Beer</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Tavazoie</surname>
<given-names>S</given-names>
</name>
</person-group>
<year>2004</year>
<article-title>Predicting gene expression from sequence.</article-title>
<source>Cell</source>
<volume>117</volume>
<fpage>185</fpage>
<lpage>98</lpage>
<pub-id pub-id-type="pmid">15084257</pub-id>
</element-citation>
</ref>
<ref id="pcbi.1000761-Ernst1">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ernst</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Vainas</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Harbison</surname>
<given-names>CT</given-names>
</name>
<name>
<surname>Simon</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Bar-Joseph</surname>
<given-names>Z</given-names>
</name>
</person-group>
<year>2007</year>
<article-title>Reconstructing dynamic regulatory maps.</article-title>
<source>Mol Syst Biol</source>
<volume>3</volume>
<fpage>74</fpage>
<pub-id pub-id-type="pmid">17224918</pub-id>
</element-citation>
</ref>
<ref id="pcbi.1000761-Segal1">
<label>16</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Segal</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Shapira</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Regev</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Pe'er</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Botstein</surname>
<given-names>D</given-names>
</name>
<etal></etal>
</person-group>
<year>2003</year>
<article-title>Module networks: Identifying regulatory modules and their condition specific regulators from gene expression data.</article-title>
<source>Nat Genet</source>
<volume>34</volume>
<fpage>166</fpage>
<lpage>176</lpage>
<pub-id pub-id-type="pmid">12740579</pub-id>
</element-citation>
</ref>
<ref id="pcbi.1000761-Middendorf1">
<label>17</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Middendorf</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kundaje</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Shah</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Freund</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Wiggins</surname>
<given-names>C</given-names>
</name>
<etal></etal>
</person-group>
<year>2005</year>
<article-title>Motif discovery through predictive modeling of gene regulation.</article-title>
<person-group person-group-type="editor">
<name>
<surname>Miyano S ea Mesirov</surname>
<given-names>J</given-names>
</name>
</person-group>
<source>RECOMB</source>
<publisher-loc>Cambridge, MA</publisher-loc>
<publisher-name>Springer</publisher-name>
<fpage>538</fpage>
<lpage>552</lpage>
</element-citation>
</ref>
<ref id="pcbi.1000761-Kundaje1">
<label>18</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kundaje</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Xin</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Lan</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Lianoglou</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
<year>2008</year>
<article-title>A predictive model of the oxygen and heme regulatory network in yeast.</article-title>
<source>PLoS Comput Biol</source>
<volume>4</volume>
</element-citation>
</ref>
<ref id="pcbi.1000761-Bussemaker1">
<label>19</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bussemaker</surname>
<given-names>HJ</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Siggia</surname>
<given-names>ED</given-names>
</name>
</person-group>
<year>2001</year>
<article-title>Regulatory element detection using correlation with expression.</article-title>
<source>Nat Genet</source>
<volume>27</volume>
<fpage>167</fpage>
<lpage>171</lpage>
<pub-id pub-id-type="pmid">11175784</pub-id>
</element-citation>
</ref>
<ref id="pcbi.1000761-Zhang1">
<label>20</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>NR</given-names>
</name>
<name>
<surname>Wildermuth</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Speed</surname>
<given-names>TP</given-names>
</name>
</person-group>
<year>2008</year>
<article-title>Transcription factor binding site prediction with multivariate gene expression data.</article-title>
<source>Ann Appl Stat</source>
<volume>2</volume>
<fpage>332</fpage>
<lpage>365</lpage>
</element-citation>
</ref>
<ref id="pcbi.1000761-Bonneau1">
<label>21</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bonneau</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Reiss</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Shannon</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Facciotti</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Hood</surname>
<given-names>L</given-names>
</name>
<etal></etal>
</person-group>
<year>2006</year>
<article-title>The inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo.</article-title>
<source>Genome Biol</source>
<volume>7</volume>
<fpage>R36</fpage>
<pub-id pub-id-type="pmid">16686963</pub-id>
</element-citation>
</ref>
<ref id="pcbi.1000761-Brilli1">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brilli</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Fani</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Lió</surname>
<given-names>P</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>MotifScorer: using a compendium of microarrays to identify regulatory motifs.</article-title>
<source>Bioinformatics</source>
<volume>23</volume>
<fpage>493</fpage>
<lpage>495</lpage>
<pub-id pub-id-type="pmid">17138590</pub-id>
</element-citation>
</ref>
<ref id="pcbi.1000761-Naughton1">
<label>23</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Naughton</surname>
<given-names>BT</given-names>
</name>
<name>
<surname>Fratkin</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Batzoglou</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Brutlag</surname>
<given-names>DL</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>A graph-based motif detection algorithm models complex nucleotide dependencies in transcription factor binding sites.</article-title>
<source>Nucleic Acids Res</source>
<volume>34</volume>
<fpage>5730</fpage>
<lpage>5739</lpage>
<pub-id pub-id-type="pmid">17041233</pub-id>
</element-citation>
</ref>
<ref id="pcbi.1000761-Das1">
<label>24</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Das</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Pellegrini</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Gray</surname>
<given-names>JW</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>A primer on regression methods for decoding cis-regulatory logic.</article-title>
<source>PLoS Comput Biol</source>
<volume>5</volume>
<fpage>e1000269</fpage>
<pub-id pub-id-type="pmid">19180174</pub-id>
</element-citation>
</ref>
<ref id="pcbi.1000761-Segal2">
<label>25</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Segal</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Raveh-Sadka</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Schroeder</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Unnerstall</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Gaul</surname>
<given-names>U</given-names>
</name>
</person-group>
<year>2008</year>
<article-title>Predicting expression patterns from regulatory sequence in drosophila segmentation.</article-title>
<source>Nature</source>
<volume>451</volume>
<fpage>535</fpage>
<lpage>540</lpage>
<pub-id pub-id-type="pmid">18172436</pub-id>
</element-citation>
</ref>
<ref id="pcbi.1000761-Wang1">
<label>26</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
</person-group>
<year>2007</year>
<article-title>Group SCAD regression analysis for microarray time course gene expression data.</article-title>
<source>Bioinformatics</source>
<volume>23</volume>
<fpage>1486</fpage>
<lpage>1494</lpage>
<pub-id pub-id-type="pmid">17463025</pub-id>
</element-citation>
</ref>
<ref id="pcbi.1000761-Hill1">
<label>27</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hill</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Hunter</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Tsung</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Tucker-Kellogg</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Brown</surname>
<given-names>E</given-names>
</name>
</person-group>
<year>2000</year>
<article-title>Genomic analysis of gene expression in C. elegans.</article-title>
<source>Science</source>
<volume>290</volume>
<fpage>809</fpage>
<lpage>812</lpage>
<pub-id pub-id-type="pmid">11052945</pub-id>
</element-citation>
</ref>
<ref id="pcbi.1000761-Jong1">
<label>28</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jong</surname>
<given-names>S</given-names>
</name>
</person-group>
<year>1993</year>
<article-title>SIMPLS: An alternative approach to partial least squares regression.</article-title>
<source>Chemom Intell Lab Syst</source>
<volume>18</volume>
<fpage>251</fpage>
<lpage>263</lpage>
</element-citation>
</ref>
<ref id="pcbi.1000761-Weinberger1">
<label>29</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Weinberger</surname>
<given-names>KQ</given-names>
</name>
<name>
<surname>Sha</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Saul</surname>
<given-names>LK</given-names>
</name>
</person-group>
<year>2007</year>
<article-title>Graph laplacian regularization for large-scale semidefinite programming.</article-title>
<person-group person-group-type="editor">
<name>
<surname>Schölkopf</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Platt</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Hoffman</surname>
<given-names>T</given-names>
</name>
</person-group>
<source>Advances in Neural Information Processing Systems 19</source>
<publisher-loc>Cambridge, MA</publisher-loc>
<publisher-name>MIT Press</publisher-name>
<fpage>1489</fpage>
<lpage>1496</lpage>
</element-citation>
</ref>
<ref id="pcbi.1000761-Chung1">
<label>30</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chung</surname>
<given-names>FRK</given-names>
</name>
</person-group>
<year>1997</year>
<article-title>Spectral Graph Theory (CBMS Regional Conference Series in Mathematics, No. 92) (Cbms Regional Conference Series in Mathematics).</article-title>
<source>American Mathematical Society</source>
</element-citation>
</ref>
<ref id="pcbi.1000761-Eskin1">
<label>31</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Eskin</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Gelfand</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Pevzner</surname>
<given-names>P</given-names>
</name>
</person-group>
<year>2002</year>
<article-title>Genome wide analysis of bacterial promoter regions.</article-title>
<source>Biocomputing 2003: Proceedings of the Pacific Symposium Hawaii, USA 3–7 January 2003</source>
<publisher-name>World Scientific Pub Co Inc</publisher-name>
<size units="page">29</size>
</element-citation>
</ref>
<ref id="pcbi.1000761-Shannon1">
<label>32</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shannon</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Markiel</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ozier</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Baliga</surname>
<given-names>NS</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>JT</given-names>
</name>
<etal></etal>
</person-group>
<year>2003</year>
<article-title>Cytoscape: a software environment for integrated models of biomolecular interaction networks.</article-title>
<source>Genome Res</source>
<volume>13</volume>
<fpage>2498</fpage>
<lpage>2504</lpage>
<pub-id pub-id-type="pmid">14597658</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000F91 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000F91 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:2861633
   |texte=   Learning “graph-mer” Motifs that Predict Gene Expression Trajectories in Development
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:20454681" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021