TelematiV1, Pmc, Corpus, bibRecord, 0003249

***** Acces problem to record *****\

Identifieur interne : 0003249 ( Pmc/Corpus ); précédent : 0003248; suivant : 0003250 ***** probable Xml problem with record *****

Links to Exploration step

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Direct vs 2-stage approaches to structured motif finding</title>
<author><name sortKey="Federico, Maria" sort="Federico, Maria" uniqKey="Federico M" first="Maria" last="Federico">Maria Federico</name>
<affiliation><nlm:aff id="I1">Dipartimento di Scienze Fisiche, Informatiche e Matematiche, Università di Modena e Reggio Emilia, 41125 Modena, Via Campi 213/b, Italy</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Leoncini, Mauro" sort="Leoncini, Mauro" uniqKey="Leoncini M" first="Mauro" last="Leoncini">Mauro Leoncini</name>
<affiliation><nlm:aff id="I1">Dipartimento di Scienze Fisiche, Informatiche e Matematiche, Università di Modena e Reggio Emilia, 41125 Modena, Via Campi 213/b, Italy</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="I2">Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, 56124 Pisa, Via Moruzzi, 1, Italy</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Montangero, Manuela" sort="Montangero, Manuela" uniqKey="Montangero M" first="Manuela" last="Montangero">Manuela Montangero</name>
<affiliation><nlm:aff id="I1">Dipartimento di Scienze Fisiche, Informatiche e Matematiche, Università di Modena e Reggio Emilia, 41125 Modena, Via Campi 213/b, Italy</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="I2">Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, 56124 Pisa, Via Moruzzi, 1, Italy</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Valente, Paolo" sort="Valente, Paolo" uniqKey="Valente P" first="Paolo" last="Valente">Paolo Valente</name>
<affiliation><nlm:aff id="I1">Dipartimento di Scienze Fisiche, Informatiche e Matematiche, Università di Modena e Reggio Emilia, 41125 Modena, Via Campi 213/b, Italy</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">22908910</idno>
<idno type="pmc">3564690</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3564690</idno>
<idno type="RBID">PMC:3564690</idno>
<idno type="doi">10.1186/1748-7188-7-20</idno>
<date when="2012">2012</date>
<idno type="wicri:Area/Pmc/Corpus">000324</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000324</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Direct vs 2-stage approaches to structured motif finding</title>
<author><name sortKey="Federico, Maria" sort="Federico, Maria" uniqKey="Federico M" first="Maria" last="Federico">Maria Federico</name>
<affiliation><nlm:aff id="I1">Dipartimento di Scienze Fisiche, Informatiche e Matematiche, Università di Modena e Reggio Emilia, 41125 Modena, Via Campi 213/b, Italy</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Leoncini, Mauro" sort="Leoncini, Mauro" uniqKey="Leoncini M" first="Mauro" last="Leoncini">Mauro Leoncini</name>
<affiliation><nlm:aff id="I1">Dipartimento di Scienze Fisiche, Informatiche e Matematiche, Università di Modena e Reggio Emilia, 41125 Modena, Via Campi 213/b, Italy</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="I2">Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, 56124 Pisa, Via Moruzzi, 1, Italy</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Montangero, Manuela" sort="Montangero, Manuela" uniqKey="Montangero M" first="Manuela" last="Montangero">Manuela Montangero</name>
<affiliation><nlm:aff id="I1">Dipartimento di Scienze Fisiche, Informatiche e Matematiche, Università di Modena e Reggio Emilia, 41125 Modena, Via Campi 213/b, Italy</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="I2">Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, 56124 Pisa, Via Moruzzi, 1, Italy</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Valente, Paolo" sort="Valente, Paolo" uniqKey="Valente P" first="Paolo" last="Valente">Paolo Valente</name>
<affiliation><nlm:aff id="I1">Dipartimento di Scienze Fisiche, Informatiche e Matematiche, Università di Modena e Reggio Emilia, 41125 Modena, Via Campi 213/b, Italy</nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">Algorithms for Molecular Biology : AMB</title>
<idno type="eISSN">1748-7188</idno>
<imprint><date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><sec><title>Background</title>
<p>The notion of <italic>DNA motif</italic>
 is a mathematical abstraction used to model regions of the DNA (known as <italic>Transcription Factor Binding Sites</italic>
, or <italic>TFBSs</italic>
) that are bound by a given Transcription Factor to regulate gene expression or repression. In turn, <italic>DNA structured motifs</italic>
 are a mathematical counterpart that models sets of TFBSs that work in concert in the gene regulations processes of higher eukaryotic organisms. Typically, a structured motif is composed of an ordered set of isolated (or <italic>simple</italic>
) motifs, separated by a variable, but somewhat constrained number of “irrelevant” base-pairs. Discovering structured motifs in a set of DNA sequences is a computationally hard problem that has been addressed by a number of authors using either a direct approach, or via the preliminary identification and successive combination of simple motifs.</p>
</sec>
<sec><title>Results</title>
<p>We describe a computational tool, named SISMA, for the de-novo discovery of structured motifs in a set of DNA sequences. SISMA is an exact, enumerative algorithm, meaning that it finds all the motifs conforming to the specifications. It does so in two stages: first it discovers all the possible component simple motifs, then combines them in a way that respects the given constraints. We developed SISMA mainly with the aim of understanding the potential benefits of such a 2-stage approach w.r.t. direct methods. In fact, no 2-stage software was available for the general problem of structured motif discovery, but only a few tools that solved restricted versions of the problem. We evaluated SISMA against other published tools on a comprehensive benchmark made of both synthetic and real biological datasets. In a significant number of cases, SISMA outperformed the competitors, exhibiting a good performance also in most of the cases in which it was inferior.</p>
</sec>
<sec><title>Conclusions</title>
<p>A reflection on the results obtained lead us to conclude that a 2-stage approach can be implemented with many advantages over direct approaches. Some of these have to do with greater modularity, ease of parallelization, and the possibility to perform adaptive searches of structured motifs. As another consideration, we noted that most hard instances for SISMA were easy to detect in advance. In these cases one may initially opt for a direct method; or, as a viable alternative in most laboratories, one could run both direct and 2-stage tools in parallel, halting the computations when the first halts.</p>
</sec>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Watson, Jd" uniqKey="Watson J">JD Watson</name>
</author>
<author><name sortKey="Baker, Ta" uniqKey="Baker T">TA Baker</name>
</author>
<author><name sortKey="Bell, Sp" uniqKey="Bell S">SP Bell</name>
</author>
<author><name sortKey="Gann, A" uniqKey="Gann A">A Gann</name>
</author>
<author><name sortKey="Levine, M" uniqKey="Levine M">M Levine</name>
</author>
<author><name sortKey="Losick, R" uniqKey="Losick R">R Losick</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Werner, T" uniqKey="Werner T">T Werner</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sinha, S" uniqKey="Sinha S">S Sinha</name>
</author>
<author><name sortKey="Tompa, M" uniqKey="Tompa M">M Tompa</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lemon, B" uniqKey="Lemon B">B Lemon</name>
</author>
<author><name sortKey="Tjian, R" uniqKey="Tjian R">R Tjian</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wray, Ga" uniqKey="Wray G">GA Wray</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bailey, Tl" uniqKey="Bailey T">TL Bailey</name>
</author>
<author><name sortKey="Elkan, C" uniqKey="Elkan C">C Elkan</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sagot, Mf" uniqKey="Sagot M">MF Sagot</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Li, M" uniqKey="Li M">M Li</name>
</author>
<author><name sortKey="Ma, B" uniqKey="Ma B">B Ma</name>
</author>
<author><name sortKey="Wang, L" uniqKey="Wang L">L Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lawrence, Ce" uniqKey="Lawrence C">CE Lawrence</name>
</author>
<author><name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author><name sortKey="Boguski, Ms" uniqKey="Boguski M">MS Boguski</name>
</author>
<author><name sortKey="Liu, Js" uniqKey="Liu J">JS Liu</name>
</author>
<author><name sortKey="Neuwald, Af" uniqKey="Neuwald A">AF Neuwald</name>
</author>
<author><name sortKey="Wootton, Jc" uniqKey="Wootton J">JC Wootton</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Brazma, A" uniqKey="Brazma A">A Brazma</name>
</author>
<author><name sortKey="Jonassen, I" uniqKey="Jonassen I">I Jonassen</name>
</author>
<author><name sortKey="Eidhammer, I" uniqKey="Eidhammer I">I Eidhammer</name>
</author>
<author><name sortKey="Gilbert, D" uniqKey="Gilbert D">D Gilbert</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Van, Helden" uniqKey="Van H">Helden van</name>
</author>
<author><name sortKey="Andre, B" uniqKey="Andre B">B André</name>
</author>
<author><name sortKey="Collado Vides, J" uniqKey="Collado Vides J">J Collado-Vides</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
<author><name sortKey="Sze, Sh" uniqKey="Sze S">SH Sze</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Guha Thakurta, D" uniqKey="Guha Thakurta D">D Guha-Thakurta</name>
</author>
<author><name sortKey="Stormo, Gd" uniqKey="Stormo G">GD Stormo</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Pavesi, G" uniqKey="Pavesi G">G Pavesi</name>
</author>
<author><name sortKey="Mauri, G" uniqKey="Mauri G">G Mauri</name>
</author>
<author><name sortKey="Pesole, G" uniqKey="Pesole G">G Pesole</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Eskin, E" uniqKey="Eskin E">E Eskin</name>
</author>
<author><name sortKey="Pevzner, P" uniqKey="Pevzner P">P Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sinha, S" uniqKey="Sinha S">S Sinha</name>
</author>
<author><name sortKey="Tompa, M" uniqKey="Tompa M">M Tompa</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Leung, Hcm" uniqKey="Leung H">HCM Leung</name>
</author>
<author><name sortKey="Chin, Fyl" uniqKey="Chin F">FYL Chin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Favorov, Av" uniqKey="Favorov A">AV Favorov</name>
</author>
<author><name sortKey="Gelfand, Ms" uniqKey="Gelfand M">MS Gelfand</name>
</author>
<author><name sortKey="Gerasimova, Av" uniqKey="Gerasimova A">AV Gerasimova</name>
</author>
<author><name sortKey="Ravcheev, Da" uniqKey="Ravcheev D">DA Ravcheev</name>
</author>
<author><name sortKey="Mironov, Aa" uniqKey="Mironov A">AA Mironov</name>
</author>
<author><name sortKey="Makeev, Vj" uniqKey="Makeev V">VJ Makeev</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Mendes, N" uniqKey="Mendes N">N Mendes</name>
</author>
<author><name sortKey="Casimiro, A" uniqKey="Casimiro A">A Casimiro</name>
</author>
<author><name sortKey="Santos, P" uniqKey="Santos P">P Santos</name>
</author>
<author><name sortKey="Sa Correia, I" uniqKey="Sa Correia I">I Sá-Correia</name>
</author>
<author><name sortKey="Oliveira, A" uniqKey="Oliveira A">A Oliveira</name>
</author>
<author><name sortKey="Freitas, A" uniqKey="Freitas A">A Freitas</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="D Aeseleer, P" uniqKey="D Aeseleer P">P D’haeseleer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Das, Mk" uniqKey="Das M">MK Das</name>
</author>
<author><name sortKey="Dai, Hk" uniqKey="Dai H">HK Dai</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Stormo, Gd" uniqKey="Stormo G">GD Stormo</name>
</author>
<author><name sortKey="Hartzell, Gw" uniqKey="Hartzell G">GW Hartzell</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wolfertstetter, F" uniqKey="Wolfertstetter F">F Wolfertstetter</name>
</author>
<author><name sortKey="Frech, K" uniqKey="Frech K">K Frech</name>
</author>
<author><name sortKey="Herrmann, G" uniqKey="Herrmann G">G Herrmann</name>
</author>
<author><name sortKey="Werner, T" uniqKey="Werner T">T Werner</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Tompa, M" uniqKey="Tompa M">M Tompa</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Linhart, C" uniqKey="Linhart C">C Linhart</name>
</author>
<author><name sortKey="Halperin, Y" uniqKey="Halperin Y">Y Halperin</name>
</author>
<author><name sortKey="Shamir, R" uniqKey="Shamir R">R Shamir</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zhang, Y" uniqKey="Zhang Y">Y Zhang</name>
</author>
<author><name sortKey="Zaki, Mj" uniqKey="Zaki M">MJ Zaki</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Pisanti, N" uniqKey="Pisanti N">N Pisanti</name>
</author>
<author><name sortKey="Carvalho, A" uniqKey="Carvalho A">A Carvalho</name>
</author>
<author><name sortKey="Marsan, L" uniqKey="Marsan L">L Marsan</name>
</author>
<author><name sortKey="Sagot, Mf" uniqKey="Sagot M">MF Sagot</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zhou, J" uniqKey="Zhou J">J Zhou</name>
</author>
<author><name sortKey="Sander, J" uniqKey="Sander J">J Sander</name>
</author>
<author><name sortKey="Lin, G" uniqKey="Lin G">G Lin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Tompa, M" uniqKey="Tompa M">M Tompa</name>
</author>
<author><name sortKey="Li, N" uniqKey="Li N">N Li</name>
</author>
<author><name sortKey="Bailey, Tl" uniqKey="Bailey T">TL Bailey</name>
</author>
<author><name sortKey="Church, Gm" uniqKey="Church G">GM Church</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Mccreight, Em" uniqKey="Mccreight E">EM McCreight</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Gusfield, D" uniqKey="Gusfield D">D Gusfield</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Marsan, L" uniqKey="Marsan L">L Marsan</name>
</author>
<author><name sortKey="Sagot, Mf" uniqKey="Sagot M">MF Sagot</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Carvalho, A" uniqKey="Carvalho A">A Carvalho</name>
</author>
<author><name sortKey="Freitas, A" uniqKey="Freitas A">A Freitas</name>
</author>
<author><name sortKey="Oliveira, A" uniqKey="Oliveira A">A Oliveira</name>
</author>
<author><name sortKey="Sagot, Mf" uniqKey="Sagot M">MF Sagot</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Allali, J" uniqKey="Allali J">J Allali</name>
</author>
<author><name sortKey="Sagot, Mf" uniqKey="Sagot M">MF Sagot</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Carvalho, A" uniqKey="Carvalho A">A Carvalho</name>
</author>
<author><name sortKey="Freitas, A" uniqKey="Freitas A">A Freitas</name>
</author>
<author><name sortKey="Oliveira, A" uniqKey="Oliveira A">A Oliveira</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Leung, Cm" uniqKey="Leung C">CM Leung</name>
</author>
<author><name sortKey="Chin, Fyl" uniqKey="Chin F">FYL Chin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Buhler, J" uniqKey="Buhler J">J Buhler</name>
</author>
<author><name sortKey="Tompa, M" uniqKey="Tompa M">M Tompa</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Davila, J" uniqKey="Davila J">J Davila</name>
</author>
<author><name sortKey="Balla, S" uniqKey="Balla S">S Balla</name>
</author>
<author><name sortKey="Rajasekaran, S" uniqKey="Rajasekaran S">S Rajasekaran</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Federico, M" uniqKey="Federico M">M Federico</name>
</author>
<author><name sortKey="Valente, P" uniqKey="Valente P">P Valente</name>
</author>
<author><name sortKey="Leoncini, M" uniqKey="Leoncini M">M Leoncini</name>
</author>
<author><name sortKey="Montangero, M" uniqKey="Montangero M">M Montangero</name>
</author>
<author><name sortKey="Cavicchioli, R" uniqKey="Cavicchioli R">R Cavicchioli</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zhu, J" uniqKey="Zhu J">J Zhu</name>
</author>
<author><name sortKey="Zhang, M" uniqKey="Zhang M">M Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Teixeira, Mc" uniqKey="Teixeira M">MC Teixeira</name>
</author>
<author><name sortKey="Monteiro, P" uniqKey="Monteiro P">P Monteiro</name>
</author>
<author><name sortKey="Jain, P" uniqKey="Jain P">P Jain</name>
</author>
<author><name sortKey="Tenreiro, S" uniqKey="Tenreiro S">S Tenreiro</name>
</author>
<author><name sortKey="Fernandes, Ar" uniqKey="Fernandes A">AR Fernandes</name>
</author>
<author><name sortKey="Mira, Np" uniqKey="Mira N">NP Mira</name>
</author>
<author><name sortKey="Alenquer, M" uniqKey="Alenquer M">M Alenquer</name>
</author>
<author><name sortKey="Freitas, At" uniqKey="Freitas A">AT Freitas</name>
</author>
<author><name sortKey="Oliveira, Al" uniqKey="Oliveira A">AL Oliveira</name>
</author>
<author><name sortKey="Sa Correia, I" uniqKey="Sa Correia I">I Sá-Correia</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Thomas Chollier, M" uniqKey="Thomas Chollier M">M Thomas-Chollier</name>
</author>
<author><name sortKey="Sand, O" uniqKey="Sand O">O Sand</name>
</author>
<author><name sortKey="Turatsinze, Jv" uniqKey="Turatsinze J">JV Turatsinze</name>
</author>
<author><name sortKey="Janky, R" uniqKey="Janky R">R Janky</name>
</author>
<author><name sortKey="Defrance, M" uniqKey="Defrance M">M Defrance</name>
</author>
<author><name sortKey="Vervisch, E" uniqKey="Vervisch E">E Vervisch</name>
</author>
<author><name sortKey="Brohee, S" uniqKey="Brohee S">S Brohee</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Carvalho, Am" uniqKey="Carvalho A">AM Carvalho</name>
</author>
<author><name sortKey="Freitas, At" uniqKey="Freitas A">AT Freitas</name>
</author>
<author><name sortKey="Oliveira, Al" uniqKey="Oliveira A">AL Oliveira</name>
</author>
<author><name sortKey="Sagot, Mf" uniqKey="Sagot M">MF Sagot</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article" xml:lang="en"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">Algorithms Mol Biol</journal-id>
<journal-id journal-id-type="iso-abbrev">Algorithms Mol Biol</journal-id>
<journal-title-group><journal-title>Algorithms for Molecular Biology : AMB</journal-title>
</journal-title-group>
<issn pub-type="epub">1748-7188</issn>
<publisher><publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">22908910</article-id>
<article-id pub-id-type="pmc">3564690</article-id>
<article-id pub-id-type="publisher-id">1748-7188-7-20</article-id>
<article-id pub-id-type="doi">10.1186/1748-7188-7-20</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Research</subject>
</subj-group>
</article-categories>
<title-group><article-title>Direct vs 2-stage approaches to structured motif finding</article-title>
</title-group>
<contrib-group><contrib contrib-type="author" id="A1"><name><surname>Federico</surname>
<given-names>Maria</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>maria.federico@unimore.it</email>
</contrib>
<contrib contrib-type="author" corresp="yes" id="A2"><name><surname>Leoncini</surname>
<given-names>Mauro</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<xref ref-type="aff" rid="I2">2</xref>
<email>leoncini@unimore.it</email>
</contrib>
<contrib contrib-type="author" id="A3"><name><surname>Montangero</surname>
<given-names>Manuela</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<xref ref-type="aff" rid="I2">2</xref>
<email>manuela.montangero@unimore.it</email>
</contrib>
<contrib contrib-type="author" id="A4"><name><surname>Valente</surname>
<given-names>Paolo</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>paolo.valente@unimore.it</email>
</contrib>
</contrib-group>
<aff id="I1"><label>1</label>
Dipartimento di Scienze Fisiche, Informatiche e Matematiche, Università di Modena e Reggio Emilia, 41125 Modena, Via Campi 213/b, Italy</aff>
<aff id="I2"><label>2</label>
Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, 56124 Pisa, Via Moruzzi, 1, Italy</aff>
<pub-date pub-type="collection"><year>2012</year>
</pub-date>
<pub-date pub-type="epub"><day>21</day>
<month>8</month>
<year>2012</year>
</pub-date>
<volume>7</volume>
<fpage>20</fpage>
<lpage>20</lpage>
<history><date date-type="received"><day>27</day>
<month>10</month>
<year>2011</year>
</date>
<date date-type="accepted"><day>25</day>
<month>7</month>
<year>2012</year>
</date>
</history>
<permissions><copyright-statement>Copyright ©2012 Federico et al.; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2012</copyright-year>
<copyright-holder>Federico et al.; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0"><license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.almob.org/content/7/1/20"></self-uri>
<abstract><sec><title>Background</title>
<p>The notion of <italic>DNA motif</italic>
 is a mathematical abstraction used to model regions of the DNA (known as <italic>Transcription Factor Binding Sites</italic>
, or <italic>TFBSs</italic>
) that are bound by a given Transcription Factor to regulate gene expression or repression. In turn, <italic>DNA structured motifs</italic>
 are a mathematical counterpart that models sets of TFBSs that work in concert in the gene regulations processes of higher eukaryotic organisms. Typically, a structured motif is composed of an ordered set of isolated (or <italic>simple</italic>
) motifs, separated by a variable, but somewhat constrained number of “irrelevant” base-pairs. Discovering structured motifs in a set of DNA sequences is a computationally hard problem that has been addressed by a number of authors using either a direct approach, or via the preliminary identification and successive combination of simple motifs.</p>
</sec>
<sec><title>Results</title>
<p>We describe a computational tool, named SISMA, for the de-novo discovery of structured motifs in a set of DNA sequences. SISMA is an exact, enumerative algorithm, meaning that it finds all the motifs conforming to the specifications. It does so in two stages: first it discovers all the possible component simple motifs, then combines them in a way that respects the given constraints. We developed SISMA mainly with the aim of understanding the potential benefits of such a 2-stage approach w.r.t. direct methods. In fact, no 2-stage software was available for the general problem of structured motif discovery, but only a few tools that solved restricted versions of the problem. We evaluated SISMA against other published tools on a comprehensive benchmark made of both synthetic and real biological datasets. In a significant number of cases, SISMA outperformed the competitors, exhibiting a good performance also in most of the cases in which it was inferior.</p>
</sec>
<sec><title>Conclusions</title>
<p>A reflection on the results obtained lead us to conclude that a 2-stage approach can be implemented with many advantages over direct approaches. Some of these have to do with greater modularity, ease of parallelization, and the possibility to perform adaptive searches of structured motifs. As another consideration, we noted that most hard instances for SISMA were easy to detect in advance. In these cases one may initially opt for a direct method; or, as a viable alternative in most laboratories, one could run both direct and 2-stage tools in parallel, halting the computations when the first halts.</p>
</sec>
</abstract>
<kwd-group><kwd>Structured motif</kwd>
<kwd>TFBS discovery</kwd>
<kwd>Combinatorial algorithms</kwd>
</kwd-group>
</article-meta>
</front>
<body><sec><title>Background</title>
<p>Understanding the complex mechanisms that regulate gene expression is a pivotal problem in molecular biology. Gene transcription
[<xref ref-type="bibr" rid="B1">1</xref>
] starts when one or more regulatory proteins bind DNA regulatory elements, which are mostly located in the promoter region nearby the transcription start site (TSS) of genes, or also further apart in eukaryotic organisms (e.g. enhancers, silencers). In eukaryotes, DNA binding proteins are called transcription factors (TFs) and regulatory elements, to which they bind, are known as transcription factor binding sites (TFBSs).</p>
<p>In lower eukaryotes TFBSs are usually short DNA strings (5-25 base pairs long) bound by a single TF, that frequently appear, with possibly some mutations, upstream of the TSS in the proximal promoter region of co-regulated genes.</p>
<p>In higher eukaryotic organisms, transcription regulation is more complex and TFBSs are more difficult to characterize
[<xref ref-type="bibr" rid="B2">2</xref>]. There may be multiple binding sites for a single TF in a single gene’s promoter region; there can be great variability in the binding sites of a single TF; the regulatory elements may be located also several kilobases away from the TSS, either upstream or downstream or in the introns of the genes that they regulate
[<xref ref-type="bibr" rid="B3">3</xref>], and in this case they are often organized in functional groups (called cis-regulatory modules)
[<xref ref-type="bibr" rid="B2">2</xref>
] bound by several interacting TFs in a cooperative or antagonistic way.</p>
<p>Being able to identify TFBSs is crucial to our understanding of the mechanisms that regulate gene expression (e.g., chronology and cell-specificity of transcription
[<xref ref-type="bibr" rid="B4">4</xref>]), and of the functions of individual genes regulated by newly discovered TFBSs
[<xref ref-type="bibr" rid="B2">2</xref>]. Also, mutations in TFBS underlie several degenerative human diseases (e.g., all forms of cancer) and constitute a substantial component of the phenotypic variability within and across species
[<xref ref-type="bibr" rid="B5">5</xref>
].</p>
<p>Structural and functional information on mechanisms of interaction between TFs and their binding sites are provided by experimental techniques, which are costly and time-consuming.</p>
<sec><title>TFBS discovery as an algorithmic problem</title>
<p>The identification of (possible) functional sites can be formulated as an algorithmic problem, provided a mathematical abstraction is given to model TFBSs. Two of the most popular such models are <italic>Position Specific Score Matrices</italic>
 (<italic>PSSM</italic>
) and <italic>Hamming distance</italic>
 (<italic>HD</italic>) models (see, e.g.,
[<xref ref-type="bibr" rid="B6">6</xref>
,<xref ref-type="bibr" rid="B7">7</xref>
]). Here we will adopt the HD model, which we now briefly recall.</p>
<p>In the HD model, a <italic>simple motif </italic>
<italic>M</italic>
<sub><italic>w</italic>
</sub>
is given by a word <italic>w</italic>
 (over the DNA alphabet), sometimes called the <italic>consensus</italic>
, together with an integer <italic>e</italic>
, 0 ≤<italic> e </italic>
< |<italic>w</italic>
|. The <italic>occurrences</italic>
 of <italic>M</italic>
<sub><italic>w </italic>
</sub>
are those words <italic>v</italic>
 whose Hamming distance from the consensus is bounded by <italic>e</italic>
, i.e., <italic>d</italic>
<sub><italic>H</italic>
</sub>
(<italic>w</italic>
,<italic>v</italic>
) ≤<italic> e</italic>
. Note that <italic>e </italic>> 0 accounts for possible mutations (here only nucleotide substitutions) in functional sites relative to the same TF. Figure
<xref ref-type="fig" rid="F1">1a</xref>
 shows examples of simple motifs and simple motif occurrences.</p>
<fig id="F1" position="float"><label>Figure 1</label>
<caption><p><bold>Simple and structured motifs.</bold>
 Part (<bold>a</bold>
): some DNA sequences with instances of the two simple motifs <monospace>GATAAG</monospace>
 (one base substitution tolerated) and <monospace>TATAAAA</monospace>
 (up two base substitutions tolerated), highlighted in blue and red, respectively. Part (<bold>b</bold>
): three instances of a structured motif (6,1)-[2,4]-(6,1)-[3,5] -(7,2).</p>
</caption>
<graphic xlink:href="1748-7188-7-20-1"></graphic>
</fig>
<p>Computational <italic>motif discovery</italic>
 (or <italic>motif finding</italic>
) can be defined as the task of inferring the mathematical abstractions subject to the identification of the occurrences (i.e., the potential binding sites) in the input sequences. The typical input to a motif discovery program under the HD model includes a pair (<italic>ℓ</italic>
,<italic>e</italic>
), which describes the length of the consensus and the maximum substitutions allowed in the occurrences, respectively.</p>
<p>Motif discovery is a very difficult problem
[<xref ref-type="bibr" rid="B8">8</xref>
], since the space of possible occurrences may be huge. The inverse problem, i.e., finding the occurrences given a motif definition, is called <italic>motif search</italic>
 and is instead comparatively much easier than discovery.</p>
<p>Simple motifs are typically used to represent TFBSs in lower eukaryotes. When more than one binding site is involved in gene regulation, as in higher eukaryotes, their collective formal description is more elaborate. Here, we are interested in formal models of so-called <italic>structured motifs</italic>
, which can be simply defined as sets of simple motifs, often called <italic>boxes</italic>
, whose occurrences, in the input DNA fragments, must satisfy given order and distance constraints. The input to a structured motif finder can be succinctly described using <italic>template strings</italic>
: </p>
<p><disp-formula><mml:math id="M1" name="1748-7188-7-20-i1" overflow="scroll"><mml:mrow><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mo>(</mml:mo>
<mml:msub><mml:mrow><mml:mi>ℓ</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub><mml:mrow><mml:mi>e</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mo>−</mml:mo>
<mml:mo>[</mml:mo>
<mml:msub><mml:mrow><mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub><mml:mrow><mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>]</mml:mo>
<mml:mo>−</mml:mo>
<mml:mo>(</mml:mo>
<mml:msub><mml:mrow><mml:mi>ℓ</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub><mml:mrow><mml:mi>e</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mo>−</mml:mo>
<mml:mo>…</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr><mml:mtd><mml:mspace width="3em"></mml:mspace>
<mml:mspace width="2em"></mml:mspace>
<mml:mo>−</mml:mo>
<mml:mo>(</mml:mo>
<mml:msub><mml:mrow><mml:mi>ℓ</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>b</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub><mml:mrow><mml:mi>e</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>b</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mo>−</mml:mo>
<mml:mo>[</mml:mo>
<mml:msub><mml:mrow><mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>b</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub><mml:mrow><mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>b</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>]</mml:mo>
<mml:mo>−</mml:mo>
<mml:mo>(</mml:mo>
<mml:msub><mml:mrow><mml:mi>ℓ</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>b</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub><mml:mrow><mml:mi>e</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>b</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p> where, for all admissible <italic>j</italic>
 and <italic>k</italic>
, <italic>ℓ</italic>
<sub><italic>j </italic>
</sub>
and <italic>e</italic>
<sub><italic>j </italic>
</sub>
constrain the motifs that can occur as box <italic>j</italic>
 according to the HD model, while <italic>d</italic>
<sub><italic>k</italic>
</sub>
 and <italic>D</italic>
<sub><italic>k </italic>
</sub>
are lower and upper bounds on the number of nucleotides between box <italic>k</italic>
 and box <italic>k</italic> + 1. Figure
<xref ref-type="fig" rid="F1">1b</xref>
 illustrates the concept in case of <italic>b </italic>
= 3.</p>
</sec>
<sec><title>Focus of the paper</title>
<p>There is an already huge literature on motif discovery (see, e.g.,
[<xref ref-type="bibr" rid="B6">6</xref>
,<xref ref-type="bibr" rid="B8">8</xref>
-<xref ref-type="bibr" rid="B19">19</xref>] and also the references contained in the survey papers
[<xref ref-type="bibr" rid="B20">20</xref>
,<xref ref-type="bibr" rid="B21">21</xref>]). However, for our purposes the proposed algorithmic solutions fall into two classes: (1) optimization algorithms (either deterministic or probabilistic), and (2) enumerative, exact algorithms. Algorithms of the first class seek motifs that optimize a certain scoring function, usually exploring only a limited portion of the space of all possible motif candidates (see
[<xref ref-type="bibr" rid="B6">6</xref>
,<xref ref-type="bibr" rid="B9">9</xref>
,<xref ref-type="bibr" rid="B10">10</xref>
,<xref ref-type="bibr" rid="B22">22</xref>
] for influential works). On the other hand, enumerative algorithms exhaustively search the motif space<sup>a</sup>
[<xref ref-type="bibr" rid="B12">12</xref>
,<xref ref-type="bibr" rid="B14">14</xref>
,<xref ref-type="bibr" rid="B16">16</xref>
,<xref ref-type="bibr" rid="B23">23</xref>
,<xref ref-type="bibr" rid="B24">24</xref>
].</p>
<p>A fundamental component of exact methods is what we can term the <italic>enumeration engine</italic>, i.e., the algorithm adopted to generate all the possible candidate motifs to be later evaluated on some statistical basis (for another example, see
[<xref ref-type="bibr" rid="B25">25</xref>]). Actually, some exact motif finding tools have been proposed which are just enumeration engines, simply returning all the motifs that satisfy the input constraints
[<xref ref-type="bibr" rid="B7">7</xref>
,<xref ref-type="bibr" rid="B26">26</xref>
-<xref ref-type="bibr" rid="B28">28</xref>
]. Clearly, an appraisal of any such engine depends on its computational efficiency only.</p>
<p>The availability of enumeration tools is useful both because they can be taken as building blocks for more sophisticated finders and because they inspired (and still inspire) research in the whole field of exact methods. In this respect, it is worth observing that the tool which best behaved in the now famous assessment by Tompa et el.
[<xref ref-type="bibr" rid="B29">29</xref>], namely Weeder, borrows its enumeration engines from
[<xref ref-type="bibr" rid="B7">7</xref>
].</p>
<p>In this paper we concentrate our attention on enumerative algorithms for structured motif discovery in a set of input DNA fragments. In particular, we focus on enumeration engines and base our analysis on running time and (to a lesser extent) memory consumption. We pay no attention to the “quality” of the results, simply because the output only depends on the input constraints posed to the motifs being sought. Running time is instead especially critical since faster enumeration leaves more room for post processing (i.e., picking the motifs deemed more likely to represent functional sites).</p>
<p>Before describing the contribution of our work, we analyze in more details the results presented in the literature that deal with enumeration engines for structured motif discovery.</p>
</sec>
<sec><title>Related work</title>
<p>Existing algorithms are essentially based on one of two possible approaches: (1) directly explore the search space of structured motifs, or (2) first extract the simple motifs that may occur as boxes (using any available simple motif finder) and then “assemble” them into structured motifs that satisfy box order and distance constraints. We shall refer to the latter as to the <italic>2-stage approach</italic>
.</p>
<p>A well-known potential advantage of directly exploring the space of structured motifs is that the combined boxes, together with the distance constraints, may be strong enough to quickly emerge, possibly together with few others spurious structures, even though each single box is a weak signal (see, e.g.,
[<xref ref-type="bibr" rid="B15">15</xref>]). We point out that the most efficient direct approach algorithms makes use of the (generalized) suffix tree data structure
[<xref ref-type="bibr" rid="B30">30</xref>
,<xref ref-type="bibr" rid="B31">31</xref>
].</p>
<p>The 2-stage approach was first mentioned by Marsan and Sagot
[<xref ref-type="bibr" rid="B32">32</xref>], who nonetheless deemed it impractical due to the high resource consumption. Recently, however, it was re-considered by Zhou et al.
[<xref ref-type="bibr" rid="B28">28</xref>
], who provided much tighter theoretical upper bounds on the runtime and space complexity. They designed the Ecomp algorithm and showed it to be more efficient than more sophisticated exact methods in their experimental settings.</p>
<p>Some available exact motif finders require that at least one instance of the motif be exact, <italic>i.e.</italic>
, that it actually appears in one of the input sequences. This leads to a reduction of the motif search space with ensuing time and space savings. This simplified version of the problem is called <italic>Frequent (Structured) Motif Discovery</italic> problem
[<xref ref-type="bibr" rid="B26">26</xref>
].</p>
<sec><title>SMILE, RISO and RISOTTO</title>
<p>SMILE
[<xref ref-type="bibr" rid="B32">32</xref>] is a family of algorithms designed to solve slightly different variants of the structured motif discovery problem on set of input sequences. SMILE extends to structured motifs the algorithmic ideas on simple motif enumeration presented in
[<xref ref-type="bibr" rid="B7">7</xref>
]<sup>b</sup>. To explore the space of possible structured motifs, SMILE uses a generalized suffix tree of the input sequences together with a (virtual) lexicographic tree of all possible simple motifs. Improvements to SMILE are presented by Carvalho et al.
[<xref ref-type="bibr" rid="B33">33</xref>
]. Their <sc>Riso</sc>
 algorithm exhibits an exponential time and space gain over SMILE in the worst case. <sc>Riso</sc> works on a variation of the generalized suffix tree (called generalized factor tree)
[<xref ref-type="bibr" rid="B34">34</xref>
], built only up to the box length level, with some extra information used for fast update of the tree. <sc>Riso</sc>
’s computational complexity is exponential with respect to the number of boxes and their lengths, but it does not depend on inter-box distances, thanks to the use of box links.</p>
<p>A further improvement is achieved by <sc>Risotto</sc>
[<xref ref-type="bibr" rid="B27">27</xref>
], although only on the average, thanks to it’s ability to quickly detect dead ends (i.e., words that cannot possibly be extended to a valid motif). In practice, <sc>Risotto</sc>
 is more than twice faster than <sc>Riso</sc>
 and, to the best of our knowledge, also the most efficient algorithm for exact enumeration of structured motifs composed of any number of boxes. For this reason, we will take <sc>Risotto</sc>
 as our primary competitor in the experimental tests.</p>
</sec>
<sec><title>ECOMP</title>
<p>Ecomp is a general, 2-stage algorithm that uses Mitra-count
[<xref ref-type="bibr" rid="B15">15</xref>
] to find all simple motifs and the starting positions of all their occurrences<sup>c</sup>
. In the second step, the algorithm looks for dyads by checking all pairs of occurrences of simple motifs, keeping and counting only those satisfying the distance constraint. At the end, Ecomp outputs only the dyads satisfying the quorum constraint.</p>
<p>Unfortunately, Ecomp is fully described and tested only for dyads and source code is apparently not available.</p>
</sec>
<sec><title>ExMotif</title>
<p>The 2-stage approach is also used by <sc>Ex</sc>
<sc>Motif</sc>
[<xref ref-type="bibr" rid="B26">26</xref>
] to solve the frequent structured motif extraction problem. <sc>Ex</sc>
<sc>Motif</sc>
’s main data structures are lists that store the positions of patterns appearing in the sequences. These lists are repeatedly intersected in order to find motifs that satisfy the input constraints. The output produced by <sc>Ex</sc>
<sc>Motif</sc>
 includes the structured motifs satisfying the quorum constraint and only the positions of their exact occurrences. The computational cost of <sc>Ex</sc>
<sc>Motif</sc>, as reported in
[<xref ref-type="bibr" rid="B26">26</xref>
], is exponential with respect to the number and the length of boxes.</p>
</sec>
<sec><title>Results</title>
<p>Our contribution is twofold. We present a novel 2-stage algorithm, called SISMA, that enumerates all the structured motifs conforming to input specifications. More precisely, we describe two different software tools that implement SISMA’s ideas. The first version, called SISMA_<sc>Smile</sc>
, solve the “unconstrained” enumeration problem, while the second one, named SISMA_<sc>Speller</sc>
, addresses the frequent structured motif discovery problem. We compare the performances of SISMA_<sc>Smile</sc>
 (resp., SISMA_<sc>Speller</sc>
) against those of <sc>Risotto</sc>
 (resp. <sc>Ex</sc>
<sc>Motif</sc>
) on a comprehensive dataset composed of both synthetic and real biological data. The experimental results show that our tools are competitive in enumerating spaces of structured motif candidates.</p>
<p>We also try to go one step further and reflect on the relative merits of direct vs 2-stage approaches for structured motifs finding. The latter enjoy some potential design advantages, such as modularity and ease of parallelization (see the concluding section for more on these aspects). However, the argument of computational inefficiency has often been used to discourage their active use. From our experiments here we can not devise strong arguments in favor of any of the two approaches. It is true that, in some circumstances, direct methods can explore spaces which are beyond the capabilities of a 2-stage algorithm. However, in other cases our 2-stage approach software results much faster than the competitor direct tool. In the concluding section we will give some guidelines (depending on parameter sets) to possibly assist the users to choose the most suitable tool for the problem instances at hand.</p>
</sec>
</sec>
</sec>
<sec sec-type="methods"><title>Methods</title>
<p>In this section we describe the implementation of <italic>SISMA</italic>
 (“Successive Intersection of Simple Motifs Apart”), a structured motif finder based on the 2-stage approach.</p>
<p>SISMA is an exact algorithm which takes in <bold>input</bold>
 the following set of parameters: </p>
<p>1. the set of sequences, in Fasta format, where the motifs must be found;</p>
<p>2. the number <italic>b </italic>
≥ 2 of boxes (simple motifs) which the structured motifs will be made of;</p>
<p>3. an ordered set of <italic>b</italic>
 pairs of integers: (<italic>ℓ</italic>
<sub><italic>i</italic>
</sub>
,<italic>e</italic>
<sub><italic>i</italic>
</sub>
), <italic>i </italic>
= 1,…,<italic>b</italic>
, such that <italic>ℓ</italic>
<sub><italic>i </italic>
</sub>
is the length of the <italic>i</italic>
<sup>th</sup>
 box and <italic>e</italic>
<sub><italic>i </italic>
</sub>
the corresponding number of admissible errors;</p>
<p>4. for each pair of consecutive boxes, say the <italic>i</italic>
<sup>th</sup>
 and <italic>i</italic>
 + 1<sup>th</sup>
 ones, a pair of integers (<italic>d</italic>
<sub><italic>i</italic>
</sub>
,<italic>D</italic>
<sub><italic>i</italic>
</sub>
) that specify the minimum and maximum number of bases, respectively, that may separate the two boxes, <italic>i </italic>
= 1,…,<italic>b </italic>
− 1.</p>
<p>5. a value <italic>q </italic>
∈ (0,1] (the so-called <italic>quorum</italic>
) that specifies the minimum fraction of input sequences that must contain an instance for the structured motif to be considered valid.</p>
<p>The <bold>output</bold>
 of the algorithm is made of all the possible structured motifs that conform to input specifications.</p>
<p>SISMA is implemented in C++ and its source code is available for download from
<ext-link ext-link-type="uri" xlink:href="http://algo.ing.unimo.it/mf/">http://algo.ing.unimo.it/mf/</ext-link>
.</p>
<sec><title>Basic implementation</title>
<p>SISMA stores simple and structured motifs using vector data structures that make it possible to perform list intersections and filtering operations (i.e., the distance and quorum checks described below) in a very efficient way (technical details can be found in Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
).</p>
<p>In current implementations, SISMA comes in two versions, which will be referred to as SISMA_<sc>Smile</sc>
 and SISMA_<sc>Speller</sc>
, solving the structured and the frequent motif discovery problems, respectively.</p>
<sec><title>Stage 1</title>
<p>SISMA_<sc>Smile</sc> first calls SMILE
[<xref ref-type="bibr" rid="B32">32</xref>
] for simple motif discovery. To the best of our knowledge, SMILE is the only tool available for download which is exact and that returns the positions of all the occurrences of found motifs<sup>d</sup>
.</p>
<p>SISMA_<sc>Speller</sc>
 uses our implementation of the <sc>Speller</sc>
[<xref ref-type="bibr" rid="B7">7</xref>
] algorithm, which returns simple motifs with at least one exact match in the input sequences, together with all their occurrences. We decided for a new <sc>Speller</sc>
 implementation because, apparently, there is no available tool with these characteristics<sup>e</sup>
.</p>
<p>Independently of the simple motif finder adopted, the output of the first stage is a set of simple motifs with associated position lists of their occurrences, each one ordered by increasing sequence indexes and increasing positions within the sequence. Logically, motifs returned by stage 1 are classified into <italic>b</italic>
 subsets, denoted by <italic>K</italic>
<sub><italic>i</italic>
</sub>
,<italic>i </italic>
= 1,…,<italic>b</italic>
, such that <italic>m </italic>
∈<italic> K</italic>
<sub><italic>i</italic>
</sub>
 if and only if <italic>m</italic>
 can be the <italic>i</italic>
<sup>th</sup>
 box of one of the structured motifs being sought. Each set <italic>K</italic>
<sub><italic>i</italic>
</sub>
is maintained as a vector data structure, each cell of which in turns stores a pointer to a vector containing all occurrences of exactly one <italic>m </italic>
∈<italic> K</italic>
<sub><italic>i</italic>
</sub>
, ordered by increasing input sequence index and increasing position in each sequence.</p>
<p>Observe that there is no filtering process of simple motifs found in this stage, because there is apparently no relations between significance of simple motifs and significance (or mere existence) of structured motif, as clearly stated in
[<xref ref-type="bibr" rid="B35">35</xref>
]. In particular, structured motifs might exist (and reach quorum) only because they contain weak simple motifs.</p>
</sec>
<sec><title>Stage 2</title>
<p>For <italic>i </italic>
= 0,…,<italic>b</italic>
, we will use the term <italic>i</italic>
<italic>-prefix</italic>
 to denote any structured motif made of <italic>i</italic>
 boxes that could possibly be extended to a full structured motif conforming to the problem specification.</p>
<p>In the second stage, which is divided in <italic>b</italic>
 steps, SISMA builds prefixes of increasing length, starting from the empty prefix in step 1 and ending with <italic>b</italic>
-prefixes (<italic>i.e.</italic>
, full structured motifs) in step <italic>b</italic>
. Basically, at generic step <italic>i</italic>
, SISMA considers all (<italic>i </italic>
− 1)-prefixes <italic>p</italic>
 and all motifs <italic>m </italic>
∈<italic> K</italic>
<sub><italic>i</italic>
</sub>
 to assemble possible <italic>i</italic>
-prefixes <italic>r </italic>
=<italic> p </italic>
− (<italic>d</italic>
<sub><italic>i</italic>
</sub>
,<italic>D</italic>
<sub><italic>i</italic>
</sub>
) −<italic> m</italic>
. The computed <italic>i</italic>
-prefixes are stored in a vector data structure, analogously to what is done with simple motifs.</p>
<p>During a prefix assembly step, SISMA checks distance and quorum constraints, in order to discard, as early as possible, prefixes that could not possibly be extended to full structured motifs.</p>
<p><italic>Distance check:</italic>
 for each potential <italic>i</italic>
-prefix <italic>r</italic>
 being built, <italic>r </italic>
=<italic> p </italic>
− (<italic>d</italic>
<sub><italic>i</italic>
</sub>
,<italic>D</italic>
<sub><italic>i</italic>
</sub>
) −<italic> m</italic>
, and any sequence <italic>s</italic>
, SISMA performs a binary search on the sorted occurrences of <italic>m</italic>
 in <italic>s</italic>
 in order to find the first occurrence that satisfies the minimum distance constraint. Then, all the subsequent occurrences of <italic>m</italic>
 are considered, until one is found that violates the maximum distance constraint. In this way, SISMA builds all occurrences of <italic>r</italic>
.</p>
<p><italic>Quorum check:</italic>
 the <italic>i</italic>
-prefix <italic>r</italic>
 is discarded if its occurrences appear in less than <italic>q</italic>
|<italic>S</italic>
| input sequences. In fact, it is obvious that prefix extension can only reduce the eventual motif quorum.</p>
</sec>
</sec>
<sec><title>Options</title>
<p>The basic implementation has been enhanced with some options that might be used to have even more efficient second stage runs, under some circumstances. Implementation details can be found in Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
.</p>
<sec><title>Box index selection option</title>
<p>When this option is selected, SISMA builds structured motifs by considering boxes of increasing total number of occurrences and not by increasing index order. In more pictorial terms, the full (final) structured motifs are not determined by assembling longer and longer prefixes but rather longer and longer structured motif “subsequences”.</p>
<p>When using this option, before starting the <italic>b</italic>
 steps of stage 2, SISMA computes the total number of simple motif occurrences in each set <italic>K</italic>
<sub><italic>i</italic>
</sub>
, <italic>i </italic>
= 1,…,<italic>b</italic>
: </p>
<p><disp-formula id="bmcM1"><label>(1)</label>
<mml:math id="M2" name="1748-7188-7-20-i2" overflow="scroll"><mml:msub><mml:mrow><mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:munder accentunder="true"><mml:mrow><mml:mi mathsize="big">∑</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>t</mml:mi>
<mml:mo>:</mml:mo>
<mml:msub><mml:mrow><mml:mi>m</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>∈</mml:mo>
<mml:msub><mml:mrow><mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:munder>
<mml:mo>|</mml:mo>
<mml:mtext mathvariant="italic">oc</mml:mtext>
<mml:msub><mml:mrow><mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>|</mml:mo>
<mml:mo>,</mml:mo>
</mml:math>
</disp-formula>
</p>
<p>where <italic>oc</italic>
<italic>c</italic>
<sub><italic>i</italic>
,<italic>t</italic>
</sub>
 is the set of occurrences of motif <italic>m</italic>
<sub><italic>t </italic>
</sub>
∈<italic> K</italic>
<sub><italic>i</italic>
</sub>
, <italic>t </italic>
= 1,…,|<italic>K</italic>
<sub><italic>i</italic>
</sub>
|. SISMA then sorts the sets {<italic>B</italic>
<sub><italic>i</italic>
</sub>
}<sub><italic>i </italic>
= 1,…,<italic>b</italic>
</sub>
 and forms a list <italic>L</italic>
 with the corresponding box indexes, i.e., if <italic>L</italic>
<sub><italic>i </italic>
</sub>
=<italic> j </italic>
and <italic>L</italic>
<sub><italic>i</italic>
 + 1</sub>
 =<italic> k</italic>
, then <italic>B</italic>
<sub><italic>j </italic>
</sub>
≤<italic> B</italic>
<sub><italic>k</italic>
</sub>
, <italic>i </italic>
= 1,…,<italic>b </italic>
− 1. Then, it uses the list <italic>L</italic>
 in stage 2 to determine the order with which to add the boxes to the structured motifs being built.</p>
<p>This option allows to limit the number of <italic>useless</italic>
 intermediate structured motifs (<italic>i</italic>
-prefixes) that are generated (<italic>i.e.</italic>
, those that eventually would be discarded, because they either could not be extended to full <italic>b</italic>
-boxes or would not satisfy the quorum constraint). This is particularly effective when the expected number of structured motifs in the output is not large, which is likely to happen, <italic>e.g.</italic>
, when the box length is large and a small number of errors are admitted, or simply when the number of boxes is large. There is a trade off between the slowdown introduced to store and handle extra information needed to implement this option and the speedup obtained by reducing the number of useless intermediate structured motifs. If the output is large, the slowdown is predominant; in contrast, if the output is small, speedup is predominant.</p>
<p>In practice, this option helped SISMA to drastically reduce the out-of-memory failures, especially on synthetic data.</p>
</sec>
<sec><title>Space-saving option</title>
<p>Sometimes the output to be generated is very large. This happens, <italic>e.g.</italic>
, when the first stage has returned a huge number of simple motifs occurrences. In turns, this can be a consequence of the particular values of the input parameters, such as very short motif lengths and/or relatively small difference between length and number of available errors. Under these circumstances, SISMA basic implementation, which keeps all the intermediate motifs (including the simple ones) in main memory, may fail due to memory shortage.</p>
<p>To cope with this situation, especially on low memory PCs, SISMA can be run with a specific option that produces the output in distinct slices, and that requires less main memory to produce the output of each slice. Given an integer <italic>v</italic>
 as the value of the space-saving option parameter, for <italic>i </italic>
= 1,…,<italic>b</italic>
, each set of motifs <italic>K</italic>
<sub><italic>i</italic>
</sub>
 is partitioned into ⌈|<italic>K</italic>
<sub><italic>i</italic>
</sub>
|/<italic>v</italic>⌉ subsets
<inline-formula><mml:math id="M3" name="1748-7188-7-20-i3" overflow="scroll"><mml:msub><mml:mrow><mml:mi mathvariant="script">K</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mo>{</mml:mo>
<mml:msub><mml:mrow><mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub><mml:mrow><mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub><mml:mrow><mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mo>⌈</mml:mo>
<mml:mo>|</mml:mo>
<mml:msub><mml:mrow><mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>|</mml:mo>
<mml:mo>/</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>⌉</mml:mo>
</mml:mrow>
</mml:msub>
<mml:mo>}</mml:mo>
</mml:math>
</inline-formula>
, each one containing at most <italic>v</italic> distinct motifs. The second stage is then run once for each possible element in the Cartesian product
<inline-formula><mml:math id="M4" name="1748-7188-7-20-i4" overflow="scroll"><mml:msub><mml:mrow><mml:mi mathvariant="script">K</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>×</mml:mo>
<mml:msub><mml:mrow><mml:mi mathvariant="script">K</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>×</mml:mo>
<mml:mo>⋯</mml:mo>
<mml:mo>×</mml:mo>
<mml:msub><mml:mrow><mml:mi mathvariant="script">K</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>b</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>
. In this way SISMA drastically reduces the number of intermediate structured motifs generated and usually avoids out-of-memory failures at the price of a moderate slowdown (see the Computational cost section).</p>
<p>The only cases that this variant of SISMA is not able to handle are those in which the memory shortage is due to stage 1, i.e., when the number of simple motifs (and their occurrences) is simply too large to fit in memory (we will see that this happens in few very difficult instances on synthetic data).</p>
</sec>
<sec><title>Occurrence print option</title>
<p>SISMA might be instructed to output the starting positions of all occurrences of the discovered structured motifs.</p>
<p>Not using this option allows a fair comparison with tools that do not print all motif occurrences, but just the motif definitions.</p>
</sec>
<sec><title>Computational cost</title>
<p>The time computational cost of SISMA is given by the cost of simple motif extraction plus that of occurrence list intersections. Here we will first refer to the basic version of SISMA, with only briefly mentioning the various options at the end of the section.</p>
<p>With the current implementation, the simple motif extraction tool must be run once for each different single box template (i.e., for all different (<italic>ℓ</italic>
,<italic>e</italic>
) pairs). Both SMILE and our modified version of <sc>Speller</sc>
 have worst-case time complexity in <italic>O</italic>
(<italic>N</italic>
<italic>t</italic>
<sub><italic>ℓ</italic>
</sub>
<italic>ν</italic>
(<italic>e</italic>
,<italic>ℓ</italic>
)), where <italic>t</italic>
<sub><italic>ℓ</italic>
</sub>
 is the number of suffix tree nodes at depth <italic>ℓ</italic>
, <italic>N</italic>
 is the number of input sequences, and <italic>ν</italic>
(<italic>e</italic>
,<italic>ℓ</italic>
) is the number of words of length <italic>ℓ</italic>
 that differ in at most <italic>e</italic>
 letters from a word <italic>m</italic>
 of length <italic>ℓ</italic>
. It holds that <italic>ν</italic>
(<italic>e</italic>
,<italic>ℓ</italic>
) ≤<italic> ℓ</italic>
<sup><italic>e</italic>
</sup>
|Σ|<sup><italic>e</italic>
</sup>
. Hence, the time complexity is linear in the input size, but possibly exponential in the number <italic>e</italic> of substitutions. Thus, as we are working with the DNA alphabet, the first stage takes
<inline-formula><mml:math id="M5" name="1748-7188-7-20-i5" overflow="scroll"><mml:mi>O</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>N</mml:mi>
<mml:mo>·</mml:mo>
<mml:munderover><mml:mrow><mml:mo>∑</mml:mo>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow><mml:mi>b</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub><mml:mrow><mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow><mml:msub><mml:mrow><mml:mi>ℓ</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:msubsup><mml:mrow><mml:mi>ℓ</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow><mml:msub><mml:mrow><mml:mi>e</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msubsup>
<mml:msup><mml:mrow><mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow><mml:msub><mml:mrow><mml:mi>e</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msup>
<mml:mo>)</mml:mo>
</mml:math>
</inline-formula>
.</p>
<p>Let <italic>B</italic>
<sub><italic>i</italic>
</sub>
be the total number of occurrences of simple motifs found for the <italic>i</italic>
<sup><italic>th</italic>
</sup>
 box in the first phase (see Equation 1), for <italic>i </italic>
= 1,…,<italic>b</italic>
, and let <italic>S</italic>
<sub><italic>j</italic>
</sub>
 be the total number of <italic>j</italic>
-prefixes occurrences found during the <italic>j</italic>
<sup><italic>th</italic>
</sup>
 step of the second stage, for <italic>j </italic>
= 2,…,<italic>b</italic>
−1. The cost of the occurrence list intersection phase is upper bounded by: </p>
<p><disp-formula><mml:math id="M6" name="1748-7188-7-20-i6" overflow="scroll"><mml:mrow><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>·</mml:mo>
<mml:msub><mml:mrow><mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub><mml:mrow><mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>·</mml:mo>
<mml:msub><mml:mrow><mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mo>⋯</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr><mml:mtd><mml:mspace width="2em"></mml:mspace>
<mml:mspace width="2em"></mml:mspace>
<mml:mspace width="2em"></mml:mspace>
<mml:mspace width="3em"></mml:mspace>
<mml:mo>⋯</mml:mo>
<mml:mo>+</mml:mo>
<mml:msub><mml:mrow><mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>b</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>·</mml:mo>
<mml:msub><mml:mrow><mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>b</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:munderover accentunder="true" accent="true"><mml:mrow><mml:mi mathsize="big">∑</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow><mml:mi>b</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:msub><mml:mrow><mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>·</mml:mo>
<mml:msub><mml:mrow><mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p> where equality holds since <italic>S</italic>
<sub>1</sub>
 =<italic> B</italic>
<sub>1</sub>
. Hence, the computational cost of SISMA is </p>
<p><disp-formula id="bmcM2"><label>(2)</label>
<mml:math id="M7" name="1748-7188-7-20-i7" overflow="scroll"><mml:mi>O</mml:mi>
<mml:mfenced open="(" close=")"><mml:mrow><mml:mi>N</mml:mi>
<mml:munderover accentunder="true" accent="true"><mml:mrow><mml:mi mathsize="big">∑</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow><mml:mi>b</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub><mml:mrow><mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow><mml:msub><mml:mrow><mml:mi>ℓ</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:msubsup><mml:mrow><mml:mi>ℓ</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow><mml:msub><mml:mrow><mml:mi>e</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msubsup>
<mml:msup><mml:mrow><mml:mo stretchy="false">|</mml:mo>
<mml:mi>Σ</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
<mml:mrow><mml:msub><mml:mrow><mml:mi>e</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:munderover accentunder="true" accent="true"><mml:mrow><mml:mi mathsize="big">∑</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow><mml:mi>b</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:msub><mml:mrow><mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>·</mml:mo>
<mml:msub><mml:mrow><mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mi>.</mml:mi>
</mml:math>
</disp-formula>
</p>
<p>Equation (2) clearly shows that the running time of the second stage depends essentially on the number of occurrences of simple motifs and that of intermediate structured motifs. Note, however, that if there is a large number of simple motifs the cost of first stage is high as well. Low cost of the first stage and high cost of second stage are possible only if there are relatively few simple motifs but many intermediate structured motifs. This is in principle possible, but in practice it hardly happens due to order and distance constraints, as the computational experiments clearly indicate. In practice, thus, the cost of extracting structured motifs is comparable with that of simple motif finding, at least when the starting positions of all the occurrences are required. As for the memory space used, it is not difficult to see that this is the maximum between: (1) SMILE space complexity needed to generate all the motifs occurrences for each box, and (2) the space needed to generate <italic>j</italic>
-prefixes occurrences, <italic>i.e.,</italic>
<italic>O</italic>
(max<sub><italic>j</italic>
∈[2<italic>.b</italic>
−1]</sub>
{<italic>S</italic>
<sub><italic>j</italic>
</sub>
 + <italic>B</italic>
<sub><italic>j</italic>
 + 1</sub>
}).</p>
<p>We now briefly consider options. As for index selection, it can be easily seen that the handling of more complex data structures in main memory introduces, in the worst case, time and memory penalties linear with the number <italic>b</italic>
 of boxes. In practice, however, non-worst case instances might run much faster with this option activated (see Options section). For what concerns the space saving option, it can be proved that the slowdown is constant, although the exact figures depend on low level implementation issues. On the machine used to perform the experiments, the running times with space saving activated were almost four times higher. We must point out, however, that the intended use of this option is just to avoid out-of-memory failures, and these can be regarded as infinite time computations. Then, in these cases the option can be thought to provide (sometimes) “unbounded” speedups. In this case, space requirement is dominated only by SMILE space requirement. Finally, SISMA actually generates all occurrences during computation, and hence printing them takes only linear time in the number of occurrences.</p>
</sec>
</sec>
</sec>
<sec sec-type="results"><title>Results</title>
<p>We have performed a series of computational experiments on both synthetic and real biological data with a twofold goal: (1) compare the direct and 2-stage approaches using the best available algorithm (<sc>Risotto</sc>
) for the former and our SISMA_<sc>Smile</sc>
 code for the latter; (2) compare SISMA_<sc>Speller</sc>
 with <sc>Ex</sc>
<sc>Motif</sc>
[<xref ref-type="bibr" rid="B26">26</xref>
], the only exact tool for the frequent structured motif extraction problem which adopts the 2-stage approach and whose code is available (see the Related work section)<sup>f</sup>
.</p>
<p>We performed all the experiments on an uniprocessor AMD Athlon 64 3200+ with 1GB of RAM, forcing a timeout of twelve hours for the execution of each tool.</p>
<sec><title>Tests on synthetic data</title>
<p>In this section we report the results of tests performed on synthetic data, which are often used to validate the effectiveness of existing methods in a fully controlled experimental setting, and to experimentally evaluate their scalability properties. In particular, we generated synthetic data sets according to the so called <italic>Planted Motif Problem</italic>
 (<italic>PMP</italic>)
[<xref ref-type="bibr" rid="B36">36</xref>
] in the following way:</p>
<p><bold>Sequence generation:</bold>
 we randomly generated the sequences of the input set <italic>S</italic>, assuming the characters of each sequence be i.i.d. and with equal probability (0.25) assigned to each symbol. According to
[<xref ref-type="bibr" rid="B36">36</xref>
], the data sets contain 20 sequences of 600 characters each.</p>
<p><bold>Structured motif planting:</bold>
 we selected the number <italic>b</italic>
 of boxes and the <italic>b</italic>
 pairs (<italic>ℓ</italic>
<sub><italic>i</italic>
</sub>
,<italic>e</italic>
<sub><italic>i</italic>
</sub>
) using different rules, which we will specify when describing the experiments. We generated distance constraints at random, making sure that the total maximum distance between the first and last box fit into the sequences. For each pair (<italic>ℓ</italic>
<sub><italic>i</italic>
</sub>
,<italic>e</italic>
<sub><italic>i</italic>
</sub>
), defining the template for a simple motif <italic>m</italic>
<sub><italic>i</italic>
</sub>
, <italic>i </italic>
= 1,…,<italic>b</italic>
, we first selected a random word <italic>w</italic>
<sub><italic>i</italic>
</sub>
(the “exact” instance of <italic>m</italic>
<sub><italic>i</italic>
</sub>
) and then generated |<italic>S</italic>
| occurrences of <italic>m</italic>
<sub><italic>i</italic>
</sub>
, at Hamming distance ≤<italic> e</italic>
<sub><italic>i </italic>
</sub>
from <italic>w</italic>
<sub><italic>i</italic>
</sub>
, by substituting <italic>e</italic>
<sub><italic>i </italic>
</sub>
characters of <italic>w</italic>
<sub><italic>i </italic>
</sub>
with characters from Σ chosen uniformly at random. Finally, we planted the occurrences, one per sequence, by respecting box order and distance constraints (but otherwise at random). When generating a dataset to be tested using <sc>Ex</sc>
<sc>Motif</sc>
, we planted at least one exact structured motif occurrence.</p>
<p>The parameters used to built the dataset where then used to run the motif finding tools on that dataset. The quorum is set to <italic>q </italic>
= 1.0 in all tests.</p>
<p>The data set generating process outlined above produces boxes that are instances of the PMP. There is a wide literature on the PMP, especially for single motif extraction (see, e.g.
[<xref ref-type="bibr" rid="B12">12</xref>
,<xref ref-type="bibr" rid="B37">37</xref>
,<xref ref-type="bibr" rid="B38">38</xref>]). Preliminary results relative to the structured motif extraction settings can be found in
[<xref ref-type="bibr" rid="B39">39</xref>] and, limited to dyads, in
[<xref ref-type="bibr" rid="B15">15</xref>
,<xref ref-type="bibr" rid="B28">28</xref>]. Using a simple model
[<xref ref-type="bibr" rid="B37">37</xref>
], we can estimate the number <italic>E</italic>
(<italic>ℓ</italic>
<italic>e</italic>
) of simple motifs that one expects to find in a randomly generated sequence, depending on the length of the motif and on the number of allowed errors. Expectation of pairs makes some instances easier to solve than others. When talking about “difficult” instances we will refer to ones in which the expected number of randomly found motifs is high.</p>
<p><bold>Experimental settings:</bold>
 Here we present results concerning a set of tests in which we planted boxes with variable lengths and number of allowed substitutions, randomly chosen among those with expectation close to one (<italic>i.e.</italic>
 for which the planted motif and a little number of other random motifs can be expected), over the following pairs: (9, 2), (10, 2), (11,2), (11, 3), (12, 3), (13, 3), (14, 1), (14, 2), (14, 3), (14, 4), (15, 1), (15, 2), (15, 3), (15, 4), (15, 5).</p>
<p>We varied the number <italic>b</italic>
 of boxes between 2 and 10 and ran the algorithms 20 times on different datasets. We ran SISMA with the <italic>box index selection</italic>
 option, which resulted very effective in this set of experiments.</p>
<p>Further results on synthetic data are briefly reported at the end of this section and, in details, in Additional file
<xref ref-type="supplementary-material" rid="S2">2</xref>
.</p>
<p><bold>Results and discussion:</bold>
 To compare pairs of tools, we used two different measures: (1) <italic>win-count</italic>
, i.e., given a common value of <italic>b</italic>
, the number of times one tool outperformed the other; (2) <italic>running times</italic>
: we report best, worst and average running times, as well as standard deviations, for each value of <italic>b</italic>
 used<sup>g</sup>
 (we computed means and standard deviations omitting the best and worst time runs). Moreover, we separately computed the above measures by considering either all runs, or only runs where both tools ended computations.</p>
</sec>
<sec><title>SISMA_<sc>Smile</sc>
 vs <sc>Risotto</sc>
</title>
<p>Figure
<xref ref-type="fig" rid="F2">2a</xref> reports win-counts, while Figure
<xref ref-type="fig" rid="F2">2b</xref>
 reports the number of times each tool fails for particular values of <italic>b</italic>
.</p>
<fig id="F2" position="float"><label>Figure 2</label>
<caption><p><bold>Risotto vs SISMA_Smile on synthetic datasets (a) Number of iterations in which SISMA_<sc>Smile</sc>
 outperforms <sc>Risotto</sc>
 or vice versa, and (<bold>b</bold>
) number of tools failures.</bold>
 Notice that when one tool fails, the other might end computation, hence, failures might not sum up to the total number of runs.</p>
</caption>
<graphic xlink:href="1748-7188-7-20-2"></graphic>
</fig>
<p>From Figure
<xref ref-type="fig" rid="F2">2a</xref>
 we can see that no tool is definitely better than the other. <sc>Risotto</sc>
 is usually more competitive for small number of boxes (up to six), but turns significantly less competitive as the number of boxes increases. Moreover, <sc>Risotto</sc>
 failed on few runs even with relatively few (i.e., six or more) boxes, usually when the first boxes are hard instances of the PMP.</p>
<p>On the other hand, SISMA_<sc>Smile</sc>
 was almost as good as <sc>Risotto</sc>
 for small number of boxes while it handled larger problem instances definitively better. In the majority of tests, SISMA_<sc>Smile</sc>
 ended computation before <sc>Risotto</sc>
 and it failed only once due to memory shortage in the first stage (because of too many simple motif occurrences).</p>
<p>Figure
<xref ref-type="fig" rid="F3">3</xref>
 reports running times. Note that, since <sc>Risotto</sc>
 is the tool which failed more frequently<sup>h</sup>
, its charts show a greater difference in running times compared to <sc>SISMA_Smile</sc>
’s. This means that the runs that were somehow difficult for <sc>Risotto</sc>
 were not particularly hard for SISMA_<sc>Smile</sc>
. It also means, on the contrary, that SISMA_<sc>Smile</sc>
’s failures occurred quite early during the computation (essentially as early as the length of a typical successful computation).</p>
<fig id="F3" position="float"><label>Figure 3</label>
<caption><p><bold>Risotto’s and SISMA_Smile’s running times on synthetic datasets.</bold>
 Worst and average run times and standard deviations (in seconds) for SISMA_<sc>Smile</sc>
 and <sc>Risotto</sc>
. Average runtime and standard deviations have been computed omitting best and worst runs. (<bold>a</bold>
) (<bold>b</bold>
) (<bold>c</bold>
) Running times of all runs are considered. (<bold>a’</bold>
) (<bold>b’</bold>
) (<bold>c’</bold>
) Only running times of runs in which both tools end computation are considered. Notice that the scale on Y-axes is not the same for all charts.</p>
</caption>
<graphic xlink:href="1748-7188-7-20-3"></graphic>
</fig>
<p>The best case was almost always favorable to <sc>Risotto</sc>
, and there was no difference for what concerns best case when considering all runs or only those without failures. In the best case, <sc>Risotto</sc>
 ended computation in less than 10 seconds (with the exception of ten boxes, where the best run took 24 secs), while <sc>Risotto</sc>
 ended computation in less than 6 seconds for <italic>b </italic>
≤ 5 and in about 30 seconds for <italic>b </italic>
> 5, never exceeding 35 seconds.</p>
<p>Looking at Figure
<xref ref-type="fig" rid="F3">3</xref>
, we observe what follows: </p>
<p>(a) In the worst case, SISMA_<sc>Smile</sc>
 is much faster than <sc>Risotto</sc>
: the longer run for SISMA_<sc>Smile</sc>
 took less than 74<italic>min</italic>
, while for <sc>Risotto</sc>
 some runs took more than 10<italic>h</italic>
, even for relatively small number of boxes and excluding failures. Hence, even when <sc>Risotto</sc>
 defeated SISMA_<sc>Smile</sc>
, the latter was still relatively fast. The opposite was not always true.</p>
<p>(b) Although average running times should be analyzed with care, <sc>Risotto</sc>
 showed average running times much worse than SISMA_<sc>Smile</sc>
’s, even without considering failures. For instance, in case of ten boxes <sc>Risotto</sc>
’s average runtime was around 1<italic>h</italic>
and a half, while SISMA_<sc>Smile</sc>
 took less than 25min.</p>
<p>(c) <sc>Risotto</sc>
 showed a greater variance across all these data.</p>
<p>We further investigated the structure of the instances in which one tool outperformed the other in order to better understand advantages and disadvantages that may be typical of the direct and 2-stage approaches (see an example in Figure
<xref ref-type="fig" rid="F4">4</xref>
). We anticipate that the key factor is the first stage of <sc>SISMA_Smile</sc>
. </p>
<p>• <sc>SISMA_Smile</sc>
 outperforms <sc>Risotto</sc>
 when SISMA_<sc>Smile</sc>
 first stage is fast. This happens mainly for two reasons (that might happen simultaneously): (i) there is a small total number of simple motifs and SMILE running time is low. (ii) Boxes are characterized by the same pair (length, errors), and hence SMILE is run only once for each pair.</p>
<p>• <sc>Risotto</sc>
<italic>outperforms SISMA_</italic>
<sc>Smile</sc>
 when the first stage of simple motif extraction is slow due to boxes producing a large number of simple motifs. This situation might also affect SISMA_<sc>Smile</sc>
 second stage: a large number of intermediate structured motifs means a time consuming occurrence list intersection stage. In some cases the phenomenon is almost completely eliminated using the box index selection option.</p>
<fig id="F4" position="float"><label>Figure 4</label>
<caption><p><bold>Examples: SISMA_Smile vs Risotto.</bold>
 Running times of <sc>Risotto</sc>
, SISMA_<sc>Smile</sc>
’s stages 1 and 2, and of all SISMA_<sc>Smile</sc>
’s list intersection step (during stage 2). A 0s time for list intersection means that the corresponding step took time smaller than timer resolution. The box index selection order during stage 2 is shown. In example (<bold>a</bold>
) SISMA_<sc>Smile</sc>
 outperforms <sc>Risotto</sc>
. Observe that SMILE is called once on the (10,2) pair, so that the time reported for the 6<sup><italic>th</italic>
</sup>
 box is 0. In this example <sc>Risotto</sc>
 is 473 times slower than SISMA. In example (<bold>b</bold>
) <sc>Risotto</sc>
 outperforms SISMA_<sc>Smile</sc>
 because the stage 1 performed by SMILE is slow due to the presence of a box for which a large number of simple motifs is found. In this case SISMA_<sc>Smile</sc>
 is 28 times slower than <sc>Risotto</sc>
. In particular, the most time consuming task is the extraction of the (15,4) box (about 91.7<italic>%</italic>
of total execution time), for which 21631 simple motifs are found.</p>
</caption>
<graphic xlink:href="1748-7188-7-20-4"></graphic>
</fig>
<p>Finally, a closer inspection on <sc>Risotto</sc>
’s behavior shows that its running time may be highly affected by the positions of boxes with large search spaces, <italic>e.g.</italic>, box (14,4). We performed a set of experiments in which we searched for planted long structured motifs characterized by the same boxes (number and positions), with just one “floating” (14,4) box, which we moved from first to last position. While the details can be found in Additional file
<xref ref-type="supplementary-material" rid="S2">2</xref>
, we observe here that <sc>Risotto</sc>
’s performance degraded considerably, while SISMA_<sc>Smile</sc>
’s behavior was essentially unaffected by the (14,4) box position.</p>
</sec>
<sec><title>SISMA_<sc>Speller</sc>
 vs <sc>ExMotif</sc>
</title>
<p>We have no results for the comparison of SISMA_<sc>Speller</sc>
 and <sc>Ex</sc>
<sc>Motif</sc>
, because the latter never ended computation within this experimental settings.</p>
<p>We explain this negative behavior observing that the set of pairs (<italic>ℓ</italic>
,<italic>e</italic>
) among which we chose the boxes of planted structured motifs contained several pairs characterized by large values of <italic>ℓ</italic>
 and <italic>e</italic>
, which <sc>Ex</sc>
<sc>Motif</sc>
 is apparently not able to address.</p>
<p>On the other hand, SISMA_<sc>Speller</sc>
 never failed within this experimental setting, exhibiting low running times. The worst case run (even for ten boxes) was never above three minutes while the best case runs lasted around three seconds for two boxes and about 33 seconds in case of ten boxes.</p>
<sec><title>Results for other synthetic experiments</title>
<p>we conclude this section by mentioning the results obtained for two other synthetic data sets (the details can be found in Additional file
<xref ref-type="supplementary-material" rid="S2">2</xref>
). </p>
<p>• We tested the tools on “presumably” (i.e., à priori) easy instances for SISMA, where the structured motifs sought were composed by boxes of the same type (i.e., same length and number of errors). Indeed SISMA_<sc>Smile</sc>
 always outperformed <sc>Risotto</sc>
 when both tools ended computation, while <sc>Ex</sc>
<sc>Motif</sc>
 outperformed SISMA_<sc>Speller</sc>
 on input instances with very small values of <italic>ℓ</italic>
,<italic>e</italic>
, and <italic>b</italic>
. For larger values, however, <sc>Ex</sc>
<sc>Motif</sc>
 did not end computations, while SISMA_<sc>Speller</sc>
 failed only when boxes were any of the known very hard PMP instances.</p>
<p>• The other data set was composed of presumably very hard instances, according to the PMP classification. We observed a relatively large number of failures, both of SISMA_<sc>Smile</sc>
 and <sc>Risotto</sc>
. The reasons were essentially those already observed, but interestingly enough, the two tools did not (usually) fail on the same instances, meaning that a difficult instance for one tool might not be so difficult for the other. <sc>Ex</sc>
<sc>Motif</sc>
 did end computation only on a limited number of instances and only in very few cases it outperformed SISMA_<sc>Speller</sc>
.</p>
</sec>
</sec>
<sec><title>Tests on real biological data</title>
<p>In this section we report the results of experiments performed on three different datasets composed of upstream regions of co-regulated genes of the <italic>Saccharomyces cerevisiae</italic>
 in order to extract motifs representing transcription factor binding sites.</p>
<sec><title>UASH-URS1-10 dataset</title>
<p>The dataset was drawn from
[<xref ref-type="bibr" rid="B13">13</xref>
]. It contains the upstream sequences of 11 meiotic genes of the <italic>Saccharomyces cerevisiae</italic>
 which are cooperatively regulated by the transcription factors URS1H and UASH involved in the meiotic expression during sporulation.</p>
<p>These 11 genes are listed in SCPD
[<xref ref-type="bibr" rid="B40">40</xref>
]. In 10 out of the 11 genes, the URS1H binding site appears downstream from UASH site and both sites are located within the upstream region -300 to -1. We included these ten regions in the dataset. We do not included a sequence for the remaining gene (HOP1) since there the binding sites are reversed and the URS1H site is placed much further upstream compared to all the other genes in the set.</p>
<p>We designed the same experimental settings as in
[<xref ref-type="bibr" rid="B26">26</xref>], except for the distance gap between the two sites. We chose a larger gap range with respect to
[<xref ref-type="bibr" rid="B26">26</xref>
] in order to approach the problem in a more realistic way, in which information about the binding site being sought may not be known. We look for structured motifs of the form (3,1)-[1,1]-(5,2)-[1,200]-(9,1). We required that structured motifs occurred in at least 7 sequences (quorum <italic>q </italic>
= 70<italic>%</italic>
). SISMA was run with the <italic>space-saving</italic>
 option.</p>
</sec>
<sec><title>UASH-URS1-5 dataset</title>
<p>In the 10 genes dataset discussed above, the two binding sites occur within at most 200 bases. However, as GuhaThakurta and Stormo suggest in
[<xref ref-type="bibr" rid="B13">13</xref>
], that gene sequences can be equally divided in two groups based on the average distance between UASH and URS1H sites. According to this, we obtained a group of 5 genes in which the binding sites are within 50 bases of each other.</p>
<p>We reproduced the experimental setup defined in
[<xref ref-type="bibr" rid="B28">28</xref>]: we analyzed the five sequences in UASH-URS1-5 and looked for dyad motifs of the form (7,1)-[1,50]-(10,2) (again, we actually used a larger distance gap with respect to
[<xref ref-type="bibr" rid="B28">28</xref>
] in order to approach the problem in a more realistic way). Quorum was set to 80%, <italic>i.e.</italic>
, at least 4 sequences.</p>
</sec>
<sec><title>KAR4P dataset</title>
<p>This dataset contains 23 genes of the <italic>Saccharomyces cerevisiae</italic> which are co-regulated by the KAR4p transcription factor required for gene regulation in response to pheromones. We obtained the list of 23 co-regulated genes from the YEASTRACT database
[<xref ref-type="bibr" rid="B41">41</xref>] and the upstream regions of those genes using the RSAT
[<xref ref-type="bibr" rid="B42">42</xref>
] retrieve sequence tool.</p>
<p>We deduced the characteristics of the KAR4p binding site from the consensus given in YEASTRACT and we looked for structured motifs of the form (3,1)-[2,2]-(4,1)-[2,2]-(3,1)-[1,1]-(2,1) occurring in at least 68% of the input sequences, i.e., in at least 16 sequences.</p>
</sec>
<sec><title>Discussion</title>
<p>Figure
<xref ref-type="fig" rid="F5">5</xref> reports the running times (in seconds) for all the four tools, while Table
<xref ref-type="table" rid="T1">1</xref>
 reports the number of simple and structured motifs found for each dataset and the average number of occurrences. We can make the following general observations: </p>
<p>• <sc>Ex</sc>
<sc>Motif</sc>
 terminated computations in all the experiments, contrary to what happened on synthetic datasets. Here, structured motifs are composed of few boxes with few substitutions allowed, experimental conditions extremely favorable to <sc>Ex</sc>
<sc>Motif</sc>
.</p>
<p>• Stage 2 is predominant on SISMA’s running time (see Table
<xref ref-type="table" rid="T1">1</xref>
), in contrast with what we observed on the synthetic datasets, because here (except for UASH-URS1-5) there is a large number of occurrences for each box, making occurrence lists intersection a demanding task.</p>
<p>• SISMA’s stage 2 is characterized by running times which increase with the number of boxes and with the total number of occurrences, coherently with the theoretical bound (see the Methods section).</p>
<fig id="F5" position="float"><label>Figure 5</label>
<caption><p><bold>Running times on biological datasets.</bold>
 Running times (in seconds) of (<bold>a</bold>
) SISMA_<sc>Smile</sc>
 and <sc>Risotto</sc>
 (<bold>b</bold>
) SISMA_<sc>Speller</sc>
 and <sc>Ex</sc>
<sc>Motif</sc>
 on biological datasets, when using an uniprocessor machine with 1GB of RAM.</p>
</caption>
<graphic xlink:href="1748-7188-7-20-5"></graphic>
</fig>
<table-wrap position="float" id="T1"><label>Table 1</label>
<caption><p>Number of simple and structured motifs on biological datasets</p>
</caption>
<table frame="hsides" rules="groups" border="1"><colgroup><col align="center"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<tbody valign="top"><tr><td colspan="4" align="center" valign="bottom"><bold>UASH-URS1_5</bold>
<hr></hr>
</td>
</tr>
<tr><td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom">SISMA_smile<hr></hr>
</td>
<td align="center" valign="bottom">SISMA_speller<hr></hr>
</td>
</tr>
<tr><td align="center" valign="bottom"><bold>Box</bold>
 (7,1)<hr></hr>
</td>
<td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom">1,452 (∼10)<hr></hr>
</td>
<td align="center" valign="bottom">420 (∼10)<hr></hr>
</td>
</tr>
<tr><td align="center" valign="bottom"><bold>Box</bold>
 (10,2)<hr></hr>
</td>
<td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom">5,472 (∼5)<hr></hr>
</td>
<td align="center" valign="bottom">92 (∼5)<hr></hr>
</td>
</tr>
<tr><td align="center" valign="bottom"><bold>Structured Motif</bold>
<hr></hr>
</td>
<td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom"> <hr></hr>
</td>
</tr>
<tr><td align="center" valign="bottom">(7,1) - [1,50] - (10,2)<hr></hr>
</td>
<td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom">16,662 (∼5)<hr></hr>
</td>
<td align="center" valign="bottom">14 (∼5)<hr></hr>
</td>
</tr>
<tr><td colspan="4" align="center" valign="bottom"><bold>UASH-URS1_10</bold>
<hr></hr>
</td>
</tr>
<tr><td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom">SISMA_smile<hr></hr>
</td>
<td align="center" valign="bottom">SISMA_speller<hr></hr>
</td>
</tr>
<tr><td align="center" valign="bottom"><bold>Box</bold>
 (3,1)<hr></hr>
</td>
<td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom">64 (∼1000)<hr></hr>
</td>
<td align="center" valign="bottom">64 (∼1000)<hr></hr>
</td>
</tr>
<tr><td align="center" valign="bottom"><bold>Box</bold>
 (5,2)<hr></hr>
</td>
<td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom">1,024 (∼1000)<hr></hr>
</td>
<td align="center" valign="bottom">942 (∼1000)<hr></hr>
</td>
</tr>
<tr><td align="center" valign="bottom"><bold>Box</bold>
 (9,1)<hr></hr>
</td>
<td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom">103 (∼10)<hr></hr>
</td>
<td align="center" valign="bottom">55 (∼10)<hr></hr>
</td>
</tr>
<tr><td align="center" valign="bottom"><bold>Structured Motif</bold>
<hr></hr>
</td>
<td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom"> <hr></hr>
</td>
</tr>
<tr><td align="center" valign="bottom">(3,1) - [ 1,1 ] - (5,2) - [ 1,200 ] - (9,1)<hr></hr>
</td>
<td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom">2,309,173 (∼ 70)<hr></hr>
</td>
<td align="center" valign="bottom">7,241 (∼ 70)<hr></hr>
</td>
</tr>
<tr><td colspan="4" align="center" valign="bottom"><bold>KAR4P</bold>
<hr></hr>
</td>
</tr>
<tr><td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom">SISMA_smile<hr></hr>
</td>
<td align="center" valign="bottom">SISMA_speller<hr></hr>
</td>
</tr>
<tr><td align="center" valign="bottom"><bold>Box</bold>
 (3,1)<hr></hr>
</td>
<td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom">64 (∼2000)<hr></hr>
</td>
<td align="center" valign="bottom">64 (∼2000)<hr></hr>
</td>
</tr>
<tr><td align="center" valign="bottom"><bold>Box</bold>
 (4,1)<hr></hr>
</td>
<td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom">256 (∼1000)<hr></hr>
</td>
<td align="center" valign="bottom">256 (∼1000)<hr></hr>
</td>
</tr>
<tr><td align="center" valign="bottom"><bold>Box</bold>
 (2,1)<hr></hr>
</td>
<td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom">16 (∼6000)<hr></hr>
</td>
<td align="center" valign="bottom">16 (∼6000)<hr></hr>
</td>
</tr>
<tr><td align="center" valign="bottom"><bold>Structured Motif</bold>
<hr></hr>
</td>
<td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom"> <hr></hr>
</td>
</tr>
<tr><td align="center">(3,1) - [2,2]- (4,1) - [2,2] -(3,1) -[1,1] - (2,1)</td>
<td align="center"> </td>
<td align="center">101,750 (∼ 50)</td>
<td align="center">858 (∼ 50)</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>The table is divided in three (sub)tables, one for each dataset. The following information apply to each sub-table. There is a row corresponding to each box type involved and one more row corresponding to the type of structured motifs to be found. Also, there is a column for each of the two versions of our SISMA algorithm (SISMA_<sc>Smile</sc>
 and SISMA_<sc>Speller</sc>
). Each cell reports two pieces of information: (1) the number of simple/structured motifs in the input sequences that conform to the given specifications, and (2) the corresponding (approximate) average number of occurrences of each simple/structured motif found.</p>
</table-wrap-foot>
</table-wrap>
<p>The tools under consideration exhibited very different behaviors on the three datasets, so that there is not a clear “overall” winner.</p>
<p>UASH-URS1-5</p>
<p>SISMA_<sc>Smile</sc>
 outperformed <sc>Risotto</sc>
 and SISMA_<sc>Speller</sc>
 outperformed <sc>Ex</sc>
<sc>Motif</sc>
. Running times and differences in running times are really small, meaning that this instance of the problem is not really challenging for any of the tools: SISMA deals with a small number of simple/structured motifs; <sc>Risotto</sc>
 drastically reduces the search space starting form the second box on; <sc>Ex</sc>
<sc>Motif</sc>
 deals with a small number of boxes and allowed substitutions.</p>
<p>UASH-URS1-10</p>
<p>On this dataset SISMA_<sc>Speller</sc>
 outperformed <sc>Ex</sc>
<sc>Motif</sc>
, while SISMA_<sc>Smile</sc>
 without the space-saving option ran out of memory. The reported results refer to SISMA_<sc>Smile</sc>
<italic> with</italic> the option activated (each set
<inline-formula><mml:math id="M8" name="1748-7188-7-20-i8" overflow="scroll"><mml:msub><mml:mrow><mml:mi mathvariant="script">K</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>
 is partitioned in at least five subsets). SISMA_<sc>Smile</sc>
 outperformed <sc>Risotto</sc>
<sup>i</sup>
.</p>
<p>This dataset results to be the worst for <sc>Risotto</sc>
, being more affected by the search space size of boxes and the by total number of structured motifs, than by the number of simple motif occurrences.</p>
<p>KAR4P</p>
<p><sc>Risotto</sc>
 and <sc>Ex</sc>
<sc>Motif</sc>
 outperformed SISMA_<sc>Smile</sc>
 and SISMA_<sc>Speller</sc>
, respectively.</p>
<p>SISMA pays a very slow second stage, due to the presence of several thousands of simple motifs occurrences in the input sequences (see Table
<xref ref-type="table" rid="T2">2</xref>
), while <sc>Risotto</sc>
 takes advantage from the fact that the occurrence-paths on the suffix tree were significantly less than actual motif occurrences, and search spaces of boxes quite small. Finally, here <sc>Ex</sc>
<sc>Motif</sc>
 is fast in the phase of neighbor generation, as in this datasets boxes allow at most one error.</p>
<table-wrap position="float" id="T2"><label>Table 2</label>
<caption><p><bold>SISMA</bold>
’s running times on biological datasets</p>
</caption>
<table frame="hsides" rules="groups" border="1"><colgroup><col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top"><tr><th align="left"> </th>
<th colspan="2" align="center">SISMA_<sc>Smile</sc>
</th>
<th colspan="2" align="center">SISMA_<sc>Speller</sc>
</th>
</tr>
</thead>
<tbody valign="top"><tr><td align="center" valign="bottom"> <hr></hr>
</td>
<td align="center" valign="bottom">1<sup><italic>st</italic>
</sup>
 stage<hr></hr>
</td>
<td align="center" valign="bottom">2<sup><italic>nd</italic>
</sup>
 stage<hr></hr>
</td>
<td align="center" valign="bottom">1<sup><italic>st</italic>
</sup>
 stage<hr></hr>
</td>
<td align="center" valign="bottom">2<sup><italic>nd</italic>
</sup>
 stage<hr></hr>
</td>
</tr>
<tr><td align="center" valign="bottom"><bold>UASH-URS1_5</bold>
<hr></hr>
</td>
<td align="center" valign="bottom">0.60 sec<hr></hr>
</td>
<td align="center" valign="bottom"><bold>23.28</bold>
 sec<hr></hr>
</td>
<td align="center" valign="bottom"><bold>3.00</bold>
 sec<hr></hr>
</td>
<td align="center" valign="bottom">0.24 sec<hr></hr>
</td>
</tr>
<tr><td align="center" valign="bottom"><bold>UASH-URS1_10</bold>
<hr></hr>
</td>
<td align="center" valign="bottom">0.61 sec<hr></hr>
</td>
<td align="center" valign="bottom"><bold>358.43</bold>
 sec<hr></hr>
</td>
<td align="center" valign="bottom">5.83 sec<hr></hr>
</td>
<td align="center" valign="bottom"><bold>26.70</bold>
 sec<hr></hr>
</td>
</tr>
<tr><td align="center"><bold>KAR4P</bold>
</td>
<td align="center">0.18 sec</td>
<td align="center"><bold>663.04</bold>
 sec</td>
<td align="center">6.20 sec</td>
<td align="center"><bold>114.64</bold>
 sec</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>Running times (in seconds) taken by stage 1 and 2 of SISMA_<sc>Smile</sc>
 and SISMA_<sc>Speller</sc>
 on biological datasets, when using an uniprocessor machine with 1GB of RAM. With one exception, stage 2 is always (much) slower than stage 1.</p>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
</sec>
<sec sec-type="conclusions"><title>Conclusions</title>
<p>Our conclusion is that the 2-stage approach cannot be turned down without due reflection. In this section we present arguments in support of this thesis and some guidelines that may help the user to choose the most efficient approach, depending on the problem instance s/he has to solve. Finally, we discuss possible improvements of SISMA.</p>
<sec><title>Direct vs 2-stage approaches</title>
<p>While implementing and working on SISMA we had the opportunity to reason on the advantages and disadvantages of the two approaches, besides running time. </p>
<p>• <italic>Modularity.</italic>
 The 2-stage approach is clearly more modular, being made of two (possibly completely) distinct software components. This makes implementation and maintenance easier. Possible optimizations and new variants can be implemented on both stages independently. Stage 1 might be optimized with new, more efficient simple motif extraction tools at negligible costs. The tool might be also easily adapted to extract simple motifs using different algorithms (not only exact), obtaining versions of the tool that tackle slightly different problems. Even more, the tool might be enhanced with the possibility for the user to choose the particular software to run in stage 1.Optimizations and variants might not be equally easy to implement under the direct approach, even though any conclusion to this end strictly depends on the particular software under consideration.</p>
<p>• <italic>Parallelization.</italic> SISMA might be easily adapted to efficiently run on a multiprocessor machine. For stage 1 there may be the availability of a parallel version of the tool adopted (such as PSmile
[<xref ref-type="bibr" rid="B43">43</xref>
]), but otherwise simple motif space enumeration could be easily partitioned and mapped on distinct processors. Stage 2 (lists intersection) might be performed simultaneously on distinct processors as well, by using a technique analogous to the space saving option (with distinct processors accessing to distinct portions of data structures to avoid memory collisions).As before, the design of a fast parallel version of a direct enumeration algorithm cannot be guaranteed without taking the algorithm itself (and its logic) into consideration.</p>
<p>• <italic>Exploratory search of structured motifs.</italic>
 The 2-stage approach seems to better adapt to an “exploratory search” utilization for structured motif finding (e.g., through a Web interface). The user might be given the possibility to independently execute the two stages (or even upload the simple motifs occurrences); given the results of the first stage, s/he then might run the second stage several times with different input parameters (say, box orders and distance constraints). This feature might be crucial in real applications, where input parameters are difficult to determine with care.This adaptive use of the tool seems really harder to make in case of direct approach without paying a high price in terms of execution times.</p>
<p>• <italic>Search space reduction.</italic>
 As already pointed out in the paper, since the direct approach looks at the structured motifs as whole, it is able to better handle instances characterized by large size search spaces of some boxes.</p>
</sec>
<sec><title>Direct tool comparison</title>
<p>The following observations might be used to guide the user toward the use of one approach/tool or the other. </p>
<p>• SISMA vs <sc>Risotto</sc>
.</p>
<p>The tests performed show that, when <sc>Risotto</sc>
 is faster than SISMA, one or both of the following conditions occur: (1) boxes have small size search spaces and a small number of simple motifs, (2) boxes with large size search spaces occur near the end (large box index) of the structured motifs. Usually, in all the other cases SISMA is faster. This is particularly evident when the structured motifs being found are composed of just one type (or few types) of boxes.<sc>Risotto</sc>
 fails for time-out (in our tests, 12 hours), SISMA fails for out-of-memory mainly because of the first stage. Hence, according to the available hardware appropriate decisions on which tool to use can be made.<sc>Risotto</sc>
 does not output occurrence positions, but only their number.</p>
<p>• SISMA vs <sc>Ex</sc>
<sc>Motif</sc>
.</p>
<p><sc>Ex</sc>
<sc>Motif</sc>
 terminates computation only for a small number of boxes and, under this circumstance, it is faster than SISMA only when boxes have small length and/or small number of admitted substitutions. It fails in any other more complex situation, making SISMA_<sc>Speller</sc>
 the only available tool for the frequent structured motif discovery problem.</p>
</sec>
<sec><title>SISMA implementation and improvements</title>
<p>We designed and developed a tool for exact structured motif discovery, based on the 2-stage approach. Incorporating simple algorithmic ideas and data structures, SISMA is accurately crafted software which proved to compete very well with other published tools for the same problem. On a comprehensive benchmark (composed of both synthetic and real biological datasets) SISMA exhibited more than acceptable performances, even on a very limited power and memory machine. Running times never exceeded the imposed deadline of 12 hours and altogether the tool failed only on few very difficult problem instances (always due to memory shortage).</p>
<p>We can improve SISMA in some respects. As the experiments clearly show, a crucial issue is the possibly high memory consumption during stage 1, which may cause SISMA to fail. The positive side is that memory critical inputs can in general be detected and appropriate actions be taken. One such action consists simply of automatically activating the space saving option. Another option amounts to interleaving simple motif extraction and list intersection. Also, we plan to implement some simple algorithmic improvements that will help to reduce (to some extent) the search space of simple motifs. For instance, we can eliminate proper prefixes of sequences when extracting specific boxes; more precisely, given box <italic>i</italic>
, we may cut prefixes that are as long as the sum of the lengths of previous boxes, plus the sum of the minimum distances between previous boxes.</p>
</sec>
</sec>
<sec><title>Endnotes</title>
<p><sup>a</sup>Search time is sometimes reduced by further constraining the motif definition, as in Weeder
[<xref ref-type="bibr" rid="B14">14</xref>
].</p>
<p><sup>b</sup>Here we will refer the tool presented in
[<xref ref-type="bibr" rid="B7">7</xref>
] as <sc>Speller</sc>
.</p>
<p><sup>c</sup>
Actually, Ecomp uses an implementation of MITRA-count made by the authors of Ecomp, since MITRA-count itself was, and still is, not available.</p>
<p><sup>d</sup>
Actually, SMILE is a suffix tree-based tool designed to find structured motifs; however it can be used also for simple motif extraction (as a structured motif finder SMILE is outperformed by <sc>Risotto</sc>
).</p>
<p><sup>e</sup>
We discarded the option of implementing a post-processing filter of SMILE output for efficiency reasons, and the option of modifying SMILE code as a more complicated one.</p>
<p><sup>f</sup>
We always have run SISMA with the print option off since <sc>Risotto</sc>
 and <sc>Ex</sc>
<sc>Motif</sc>
 have no possibility to output the starting positions of all occurrences. To be fair, <sc>Ex</sc>
<sc>Motif</sc>
 actually prints the positions of the exact occurrences.</p>
<p><sup>g</sup>
Observe that direct comparison on average running times might be of little significance, as they might vary considerably on input the same value of <italic>b</italic>
 (even resulting in charts that might show a non monotonic behavior).</p>
<p><sup>h</sup>
For <sc>Risotto</sc>
 failure always means “runs beyond the deadline.”</p>
<p><sup>i</sup>
On machines with more main memory the gap between the running times would have been even more favorable to SISMA<sc>Smile</sc>
.</p>
</sec>
<sec><title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec><title>Authors’ contributions</title>
<p>All the authors contributed equally to the design of the work. Moreover, PV wrote the software, MF supervised the experiments, ML and MM wrote the paper. All authors read and approved the final manuscript.</p>
</sec>
<sec sec-type="supplementary-material"><title>Supplementary Material</title>
<supplementary-material content-type="local-data" id="S1"><caption><title>Additional file 1</title>
<p><bold>Implementation details.</bold> Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
 (in pdf format) contains details on the basic implementation of SISMA_<sc>Smile</sc>
 and the index box selection variant.</p>
</caption>
<media xlink:href="1748-7188-7-20-S1.pdf" mimetype="application" mime-subtype="pdf"><caption><p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S2"><caption><title>Additional file 2</title>
<p><bold>More experiments on synthetic dataset.</bold> Additional file
<xref ref-type="supplementary-material" rid="S2">2</xref>
 (in pdf format) includes the results obtained on two more synthetic datasets: one designed as an easier benchmark, one as a particularly hard benchmark for all the tools. Moreover, results are shown for a specific test designed for Risotto, in order to investigate how its performance varies according to boxes order.</p>
</caption>
<media xlink:href="1748-7188-7-20-S2.pdf" mimetype="application" mime-subtype="pdf"><caption><p>Click here for file</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back><sec><title>Acknowledgements</title>
<p>The authors wish to thank the anonymous reviewers for the constructive criticisms that helped to improve the manuscript, and Roberto Cavicchioli for the work done on a preliminary version of this endeavor.</p>
</sec>
<ref-list><ref id="B1"><mixed-citation publication-type="other"><name><surname>Watson</surname>
<given-names>JD</given-names>
</name>
<name><surname>Baker</surname>
<given-names>TA</given-names>
</name>
<name><surname>Bell</surname>
<given-names>SP</given-names>
</name>
<name><surname>Gann</surname>
<given-names>A</given-names>
</name>
<name><surname>Levine</surname>
<given-names>M</given-names>
</name>
<name><surname>Losick</surname>
<given-names>R</given-names>
</name>
<source>Molecular Biology of the Gene</source>
<comment>6/e: Pearson International Edition; 2007</comment>
</mixed-citation>
</ref>
<ref id="B2"><mixed-citation publication-type="journal"><name><surname>Werner</surname>
<given-names>T</given-names>
</name>
<article-title>Models for prediction and recognition of eukaryotic promoters</article-title>
<source>Mammalian Genome</source>
<year>1999</year>
<volume>10</volume>
<fpage>168</fpage>
<lpage>175</lpage>
<pub-id pub-id-type="doi">10.1007/s003359900963</pub-id>
<pub-id pub-id-type="pmid">9922398</pub-id>
</mixed-citation>
</ref>
<ref id="B3"><mixed-citation publication-type="journal"><name><surname>Sinha</surname>
<given-names>S</given-names>
</name>
<name><surname>Tompa</surname>
<given-names>M</given-names>
</name>
<article-title>Discovery of novel transcription factor binding sites by statistical overrepresentation</article-title>
<source>Nucleic Acids Res</source>
<year>2002</year>
<volume>30</volume>
<fpage>5549</fpage>
<lpage>5560</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkf669</pub-id>
<pub-id pub-id-type="pmid">12490723</pub-id>
</mixed-citation>
</ref>
<ref id="B4"><mixed-citation publication-type="journal"><name><surname>Lemon</surname>
<given-names>B</given-names>
</name>
<name><surname>Tjian</surname>
<given-names>R</given-names>
</name>
<article-title>Orchestrated response: a symphony of transcription factors for gene control</article-title>
<source>Genes & Dev</source>
<year>2000</year>
<volume>14</volume>
<fpage>2551</fpage>
<lpage>2569</lpage>
<pub-id pub-id-type="doi">10.1101/gad.831000</pub-id>
<pub-id pub-id-type="pmid">11040209</pub-id>
</mixed-citation>
</ref>
<ref id="B5"><mixed-citation publication-type="journal"><name><surname>Wray</surname>
<given-names>GA</given-names>
</name>
<article-title>The evolutionary significance of cis-regulatory mutations</article-title>
<source>Nature Rev Genet</source>
<year>2007</year>
<volume>8</volume>
<fpage>206</fpage>
<lpage>216</lpage>
<pub-id pub-id-type="pmid">17304246</pub-id>
</mixed-citation>
</ref>
<ref id="B6"><mixed-citation publication-type="other"><name><surname>Bailey</surname>
<given-names>TL</given-names>
</name>
<name><surname>Elkan</surname>
<given-names>C</given-names>
</name>
<article-title>The Value of Prior Knowledge in Discovering Motifs with MEME</article-title>
<source>Proceedings of 3rd International Conference on Intelligent Systems for Molecular Biology (ISMB ’95)</source>
<year>1995</year>
<fpage>21</fpage>
<lpage>29</lpage>
</mixed-citation>
</ref>
<ref id="B7"><mixed-citation publication-type="journal"><name><surname>Sagot</surname>
<given-names>MF</given-names>
</name>
<article-title>Spelling approximate repeated or common motifs using a suffix tree</article-title>
<source>Lecture Notes Comput Sci</source>
<year>1998</year>
<volume>1380</volume>
<fpage>111</fpage>
<lpage>127</lpage>
</mixed-citation>
</ref>
<ref id="B8"><mixed-citation publication-type="other"><name><surname>Li</surname>
<given-names>M</given-names>
</name>
<name><surname>Ma</surname>
<given-names>B</given-names>
</name>
<name><surname>Wang</surname>
<given-names>L</given-names>
</name>
<article-title>Finding Similar Regions in Many Strings</article-title>
<source>Proceedings of the 31th Annual ACM Symposium on Theory of Computing (STOC ’99)</source>
<year>1999</year>
<fpage>473</fpage>
<lpage>482</lpage>
</mixed-citation>
</ref>
<ref id="B9"><mixed-citation publication-type="journal"><name><surname>Lawrence</surname>
<given-names>CE</given-names>
</name>
<name><surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name><surname>Boguski</surname>
<given-names>MS</given-names>
</name>
<name><surname>Liu</surname>
<given-names>JS</given-names>
</name>
<name><surname>Neuwald</surname>
<given-names>AF</given-names>
</name>
<name><surname>Wootton</surname>
<given-names>JC</given-names>
</name>
<article-title>Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment</article-title>
<source>Science</source>
<year>1993</year>
<volume>262</volume>
<fpage>208</fpage>
<lpage>214</lpage>
<pub-id pub-id-type="doi">10.1126/science.8211139</pub-id>
<pub-id pub-id-type="pmid">8211139</pub-id>
</mixed-citation>
</ref>
<ref id="B10"><mixed-citation publication-type="journal"><name><surname>Brazma</surname>
<given-names>A</given-names>
</name>
<name><surname>Jonassen</surname>
<given-names>I</given-names>
</name>
<name><surname>Eidhammer</surname>
<given-names>I</given-names>
</name>
<name><surname>Gilbert</surname>
<given-names>D</given-names>
</name>
<article-title>Approaches to the Automatic Discovery of Patterns in Biosequences</article-title>
<source>J Comput Biol</source>
<year>1998</year>
<volume>5</volume>
<issue>2</issue>
<fpage>277</fpage>
<lpage>304</lpage>
<comment><ext-link ext-link-type="uri" xlink:href="http://citeseer.ist.psu.edu/article/brazma97approaches.html">http://citeseer.ist.psu.edu/article/brazma97approaches.html</ext-link>
</comment>
</mixed-citation>
</ref>
<ref id="B11"><mixed-citation publication-type="journal"><name><surname>van</surname>
<given-names>Helden</given-names>
<suffix>J</suffix>
</name>
<name><surname>André</surname>
<given-names>B</given-names>
</name>
<name><surname>Collado-Vides</surname>
<given-names>J</given-names>
</name>
<article-title>Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies</article-title>
<source>Mol Biol</source>
<year>1998</year>
<volume>281</volume>
<fpage>827</fpage>
<lpage>842</lpage>
<comment><ext-link ext-link-type="uri" xlink:href="http://citeseer.ist.psu.edu/biol02extracting.html">http://citeseer.ist.psu.edu/biol02extracting.html</ext-link>
</comment>
<pub-id pub-id-type="doi">10.1006/jmbi.1998.1947</pub-id>
</mixed-citation>
</ref>
<ref id="B12"><mixed-citation publication-type="other"><name><surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
<name><surname>Sze</surname>
<given-names>SH</given-names>
</name>
<article-title>Combinatorial Approaches to Finding Subtle Signals in DNA Sequences</article-title>
<source>Proceedings of 8th International Conference on Intelligent Systems for Molecular Biology (ISMB ’00)</source>
<year>2000</year>
<fpage>269</fpage>
<lpage>278</lpage>
</mixed-citation>
</ref>
<ref id="B13"><mixed-citation publication-type="journal"><name><surname>Guha-Thakurta</surname>
<given-names>D</given-names>
</name>
<name><surname>Stormo</surname>
<given-names>GD</given-names>
</name>
<article-title>Identifying target sites for cooperatively binding factors</article-title>
<source>Bioinformatics</source>
<year>2001</year>
<volume>17</volume>
<fpage>608</fpage>
<lpage>621</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/17.7.608</pub-id>
<pub-id pub-id-type="pmid">11448879</pub-id>
</mixed-citation>
</ref>
<ref id="B14"><mixed-citation publication-type="journal"><name><surname>Pavesi</surname>
<given-names>G</given-names>
</name>
<name><surname>Mauri</surname>
<given-names>G</given-names>
</name>
<name><surname>Pesole</surname>
<given-names>G</given-names>
</name>
<article-title>An algorithm for finding signals of unknown length in DNA sequences</article-title>
<source>Bioinformatics</source>
<year>2001</year>
<volume>17</volume>
<fpage>207</fpage>
<lpage>214</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/17.suppl_1.S207</pub-id>
</mixed-citation>
</ref>
<ref id="B15"><mixed-citation publication-type="other"><name><surname>Eskin</surname>
<given-names>E</given-names>
</name>
<name><surname>Pevzner</surname>
<given-names>P</given-names>
</name>
<article-title>Finding composite regulatory patterns in DNA sequences</article-title>
<source>Proceedings of the 10th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB ’02)</source>
<year>2002</year>
<fpage>S354</fpage>
<lpage>S363</lpage>
</mixed-citation>
</ref>
<ref id="B16"><mixed-citation publication-type="journal"><name><surname>Sinha</surname>
<given-names>S</given-names>
</name>
<name><surname>Tompa</surname>
<given-names>M</given-names>
</name>
<article-title>YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation</article-title>
<source>Nucleic Acids Res</source>
<year>2003</year>
<volume>31</volume>
<fpage>3586</fpage>
<lpage>3588</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkg618</pub-id>
<pub-id pub-id-type="pmid">12824371</pub-id>
</mixed-citation>
</ref>
<ref id="B17"><mixed-citation publication-type="other"><name><surname>Leung</surname>
<given-names>HCM</given-names>
</name>
<name><surname>Chin</surname>
<given-names>FYL</given-names>
</name>
<article-title>Generalized Planted (l, d)-Motif Problem with Negative Set</article-title>
<source>Proceedings of the Workshop on Algorithms in Bioinformatics (WABI)</source>
<year>2005</year>
<fpage>264</fpage>
<lpage>275</lpage>
</mixed-citation>
</ref>
<ref id="B18"><mixed-citation publication-type="journal"><name><surname>Favorov</surname>
<given-names>AV</given-names>
</name>
<name><surname>Gelfand</surname>
<given-names>MS</given-names>
</name>
<name><surname>Gerasimova</surname>
<given-names>AV</given-names>
</name>
<name><surname>Ravcheev</surname>
<given-names>DA</given-names>
</name>
<name><surname>Mironov</surname>
<given-names>AA</given-names>
</name>
<name><surname>Makeev</surname>
<given-names>VJ</given-names>
</name>
<article-title>A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length</article-title>
<source>Bioinformatics</source>
<year>2005</year>
<volume>21</volume>
<fpage>2240</fpage>
<lpage>2245</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bti336</pub-id>
<pub-id pub-id-type="pmid">15728117</pub-id>
</mixed-citation>
</ref>
<ref id="B19"><mixed-citation publication-type="journal"><name><surname>Mendes</surname>
<given-names>N</given-names>
</name>
<name><surname>Casimiro</surname>
<given-names>A</given-names>
</name>
<name><surname>Santos</surname>
<given-names>P</given-names>
</name>
<name><surname>Sá-Correia</surname>
<given-names>I</given-names>
</name>
<name><surname>Oliveira</surname>
<given-names>A</given-names>
</name>
<name><surname>Freitas</surname>
<given-names>A</given-names>
</name>
<article-title>MUSA: a parameter free algorithm for the identification of biologically significant motifs</article-title>
<source>Bioinformatics</source>
<year>2006</year>
<volume>22</volume>
<fpage>2996</fpage>
<lpage>3002</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btl537</pub-id>
<pub-id pub-id-type="pmid">17068086</pub-id>
</mixed-citation>
</ref>
<ref id="B20"><mixed-citation publication-type="journal"><name><surname>D’haeseleer</surname>
<given-names>P</given-names>
</name>
<article-title>How does DNA sequence motif discovery work?</article-title>
<source>Nat Biotech</source>
<year>2006</year>
<volume>24</volume>
<issue>8</issue>
<fpage>959</fpage>
<lpage>961</lpage>
<comment><ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1038/nbt0806-959">http://dx.doi.org/10.1038/nbt0806-959</ext-link>
</comment>
<pub-id pub-id-type="doi">10.1038/nbt0806-959</pub-id>
</mixed-citation>
</ref>
<ref id="B21"><mixed-citation publication-type="journal"><name><surname>Das</surname>
<given-names>MK</given-names>
</name>
<name><surname>Dai</surname>
<given-names>HK</given-names>
</name>
<article-title>A survey of dna motif finding algorithms</article-title>
<source>BMC Bioinformatics</source>
<year>2007</year>
<volume>8</volume>
<fpage>S21</fpage>
<pub-id pub-id-type="pmid">18047721</pub-id>
</mixed-citation>
</ref>
<ref id="B22"><mixed-citation publication-type="journal"><name><surname>Stormo</surname>
<given-names>GD</given-names>
</name>
<name><surname>Hartzell</surname>
<given-names>GW</given-names>
<suffix>III</suffix>
</name>
<article-title>Identifying protein binding sites from unaligned DNA fragments</article-title>
<source>PNAS</source>
<year>1989</year>
<volume>86</volume>
<fpage>1183</fpage>
<lpage>1187</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.86.4.1183</pub-id>
<pub-id pub-id-type="pmid">2919167</pub-id>
</mixed-citation>
</ref>
<ref id="B23"><mixed-citation publication-type="journal"><name><surname>Wolfertstetter</surname>
<given-names>F</given-names>
</name>
<name><surname>Frech</surname>
<given-names>K</given-names>
</name>
<name><surname>Herrmann</surname>
<given-names>G</given-names>
</name>
<name><surname>Werner</surname>
<given-names>T</given-names>
</name>
<article-title>Identification of functional elements in unaligned nucleic acid sequences by a novel tuple search algorithm</article-title>
<source>Comput Appl Biosci</source>
<year>1996</year>
<volume>12</volume>
<fpage>71</fpage>
<lpage>80</lpage>
<pub-id pub-id-type="pmid">8670622</pub-id>
</mixed-citation>
</ref>
<ref id="B24"><mixed-citation publication-type="other"><name><surname>Tompa</surname>
<given-names>M</given-names>
</name>
<article-title>An exact method for finding short motifs in sequences, with application to the ribosome binding site problem</article-title>
<source>Proceedings of 7th International Conference on Intelligent Systems for Molecular Biology (ISMB ’99)</source>
<year>1999</year>
<fpage>262</fpage>
<lpage>271</lpage>
</mixed-citation>
</ref>
<ref id="B25"><mixed-citation publication-type="journal"><name><surname>Linhart</surname>
<given-names>C</given-names>
</name>
<name><surname>Halperin</surname>
<given-names>Y</given-names>
</name>
<name><surname>Shamir</surname>
<given-names>R</given-names>
</name>
<article-title>Transcription factor and microRNA motif discovery: The Amadeus platform and a compendium of metazoan target sets</article-title>
<source>Genome Res</source>
<year>2008</year>
<volume>18</volume>
<issue>7</issue>
<fpage>1180</fpage>
<lpage>1189</lpage>
<pub-id pub-id-type="doi">10.1101/gr.076117.108</pub-id>
<pub-id pub-id-type="pmid">18411406</pub-id>
</mixed-citation>
</ref>
<ref id="B26"><mixed-citation publication-type="journal"><name><surname>Zhang</surname>
<given-names>Y</given-names>
</name>
<name><surname>Zaki</surname>
<given-names>MJ</given-names>
</name>
<article-title>EXMOTIF: efficient structured motif extraction</article-title>
<source>Algorithms Mol Biol</source>
<year>2006</year>
<volume>1</volume>
<fpage>21</fpage>
<pub-id pub-id-type="doi">10.1186/1748-7188-1-21</pub-id>
<pub-id pub-id-type="pmid">17109757</pub-id>
</mixed-citation>
</ref>
<ref id="B27"><mixed-citation publication-type="other"><name><surname>Pisanti</surname>
<given-names>N</given-names>
</name>
<name><surname>Carvalho</surname>
<given-names>A</given-names>
</name>
<name><surname>Marsan</surname>
<given-names>L</given-names>
</name>
<name><surname>Sagot</surname>
<given-names>MF</given-names>
</name>
<article-title>RISOTTO: Fast extraction of motifs with mismatches</article-title>
<source>Proceedings of the 7th Latin American Theoretical Informatics Symposium</source>
<year>2006</year>
</mixed-citation>
</ref>
<ref id="B28"><mixed-citation publication-type="journal"><name><surname>Zhou</surname>
<given-names>J</given-names>
</name>
<name><surname>Sander</surname>
<given-names>J</given-names>
</name>
<name><surname>Lin</surname>
<given-names>G</given-names>
</name>
<article-title>Efficient composite pattern finding from monad patterns</article-title>
<source>Int J Bioinf Res Appl</source>
<year>2007</year>
<volume>3</volume>
<fpage>86</fpage>
<lpage>99</lpage>
<pub-id pub-id-type="doi">10.1504/IJBRA.2007.011836</pub-id>
</mixed-citation>
</ref>
<ref id="B29"><mixed-citation publication-type="journal"><name><surname>Tompa</surname>
<given-names>M</given-names>
</name>
<name><surname>Li</surname>
<given-names>N</given-names>
</name>
<name><surname>Bailey</surname>
<given-names>TL</given-names>
</name>
<name><surname>Church</surname>
<given-names>GM</given-names>
</name>
<collab>et. al</collab>
<article-title>Assessingcomputational tools for the discovery of transcription factor binding sites</article-title>
<source>Nature Biotechnol</source>
<year>2005</year>
<volume>23</volume>
<fpage>137</fpage>
<lpage>144</lpage>
<comment><ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/pubmed/15637633">http://www.ncbi.nlm.nih.gov/pubmed/15637633</ext-link>
</comment>
<pub-id pub-id-type="doi">10.1038/nbt1053</pub-id>
<pub-id pub-id-type="pmid">15637633</pub-id>
</mixed-citation>
</ref>
<ref id="B30"><mixed-citation publication-type="journal"><name><surname>McCreight</surname>
<given-names>EM</given-names>
</name>
<article-title>A Space-Economical Suffix Tree Construction Algorithm</article-title>
<source>J ACM</source>
<year>1976</year>
<volume>23</volume>
<issue>2</issue>
<fpage>262</fpage>
<lpage>272</lpage>
<pub-id pub-id-type="doi">10.1145/321941.321946</pub-id>
</mixed-citation>
</ref>
<ref id="B31"><mixed-citation publication-type="book"><name><surname>Gusfield</surname>
<given-names>D</given-names>
</name>
<source>Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology</source>
<year>1997</year>
<publisher-name>New York: Cambridge University Press</publisher-name>
</mixed-citation>
</ref>
<ref id="B32"><mixed-citation publication-type="journal"><name><surname>Marsan</surname>
<given-names>L</given-names>
</name>
<name><surname>Sagot</surname>
<given-names>MF</given-names>
</name>
<article-title>Algorithms for Extracting Structured Motifs Using a Suffix Tree with an Application to Promoter and Regulatory Site Consensus Identification</article-title>
<source>J Comput Biol</source>
<year>2000</year>
<volume>7</volume>
<issue>3-4</issue>
<fpage>345</fpage>
<lpage>362</lpage>
<pub-id pub-id-type="doi">10.1089/106652700750050826</pub-id>
<pub-id pub-id-type="pmid">11108467</pub-id>
</mixed-citation>
</ref>
<ref id="B33"><mixed-citation publication-type="other"><name><surname>Carvalho</surname>
<given-names>A</given-names>
</name>
<name><surname>Freitas</surname>
<given-names>A</given-names>
</name>
<name><surname>Oliveira</surname>
<given-names>A</given-names>
</name>
<name><surname>Sagot</surname>
<given-names>MF</given-names>
</name>
<article-title>A highly scalable algorithm for the extraction of cis-regulatory regions</article-title>
<source>Proceedings of the Asia-Pacific Bioinformatics Conference</source>
<year>2005</year>
<fpage>273</fpage>
<lpage>282</lpage>
</mixed-citation>
</ref>
<ref id="B34"><mixed-citation publication-type="other"><name><surname>Allali</surname>
<given-names>J</given-names>
</name>
<name><surname>Sagot</surname>
<given-names>MF</given-names>
</name>
<source>The at most k-deep factor tree. Tech</source>
<year>2004</year>
</mixed-citation>
</ref>
<ref id="B35"><mixed-citation publication-type="other"><name><surname>Carvalho</surname>
<given-names>A</given-names>
</name>
<name><surname>Freitas</surname>
<given-names>A</given-names>
</name>
<name><surname>Oliveira</surname>
<given-names>A</given-names>
<suffix>Sagot</suffix>
</name>
<article-title>Efficient Extraction of Structured Motifs Using Box-links</article-title>
<source>Proceedings of 11th Conference on String Processing and Information Retrieval</source>
<year>2004</year>
<fpage>267</fpage>
<lpage>268</lpage>
<comment><ext-link ext-link-type="uri" xlink:href="http://citeseer.ist.psu.edu/viewdoc/summary?">http://citeseer.ist.psu.edu/viewdoc/summary?</ext-link>
doi:</comment>
<comment>10.1.1.102.9439</comment>
</mixed-citation>
</ref>
<ref id="B36"><mixed-citation publication-type="journal"><name><surname>Leung</surname>
<given-names>CM</given-names>
</name>
<name><surname>Chin</surname>
<given-names>FYL</given-names>
</name>
<article-title>Algorithms for Challenging Motif Problems</article-title>
<source>J Bioinf Comput Biol</source>
<year>2006</year>
<volume>4</volume>
<fpage>43</fpage>
<lpage>58</lpage>
<pub-id pub-id-type="doi">10.1142/S0219720006001692</pub-id>
</mixed-citation>
</ref>
<ref id="B37"><mixed-citation publication-type="journal"><name><surname>Buhler</surname>
<given-names>J</given-names>
</name>
<name><surname>Tompa</surname>
<given-names>M</given-names>
</name>
<article-title>Finding motifs using random projections</article-title>
<source>J Comput Biol</source>
<year>2002</year>
<volume>9</volume>
<fpage>225</fpage>
<lpage>242</lpage>
<pub-id pub-id-type="doi">10.1089/10665270252935430</pub-id>
<pub-id pub-id-type="pmid">12015879</pub-id>
</mixed-citation>
</ref>
<ref id="B38"><mixed-citation publication-type="journal"><name><surname>Davila</surname>
<given-names>J</given-names>
</name>
<name><surname>Balla</surname>
<given-names>S</given-names>
</name>
<name><surname>Rajasekaran</surname>
<given-names>S</given-names>
</name>
<article-title>Fast and Practical Algorithms for Planted (l,d)-Motif Search</article-title>
<source>IEEE/ACM Trans Comput Biol Bioinf (TCBB)</source>
<year>2007</year>
<volume>4</volume>
<issue>4</issue>
<fpage>544</fpage>
<lpage>552</lpage>
</mixed-citation>
</ref>
<ref id="B39"><mixed-citation publication-type="other"><name><surname>Federico</surname>
<given-names>M</given-names>
</name>
<name><surname>Valente</surname>
<given-names>P</given-names>
</name>
<name><surname>Leoncini</surname>
<given-names>M</given-names>
</name>
<name><surname>Montangero</surname>
<given-names>M</given-names>
</name>
<name><surname>Cavicchioli</surname>
<given-names>R</given-names>
</name>
<article-title>An Efficient Algorithm for Planted Structured Motif Extraction</article-title>
<source>CompBio ’09: Proceedings of the 1st ACM Workshop on Breaking Frontiers of Computational Biology</source>
<year>2009</year>
<fpage>1</fpage>
<lpage>6</lpage>
</mixed-citation>
</ref>
<ref id="B40"><mixed-citation publication-type="journal"><name><surname>Zhu</surname>
<given-names>J</given-names>
</name>
<name><surname>Zhang</surname>
<given-names>M</given-names>
</name>
<article-title>SCPD: a promoter database of the yeast Saccharomyces cerevisiae</article-title>
<source>Bioinformatics</source>
<year>1999</year>
<volume>15</volume>
<fpage>607</fpage>
<lpage>611</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/15.7.607</pub-id>
<pub-id pub-id-type="pmid">10487868</pub-id>
</mixed-citation>
</ref>
<ref id="B41"><mixed-citation publication-type="journal"><name><surname>Teixeira</surname>
<given-names>MC</given-names>
</name>
<name><surname>Monteiro</surname>
<given-names>P</given-names>
</name>
<name><surname>Jain</surname>
<given-names>P</given-names>
</name>
<name><surname>Tenreiro</surname>
<given-names>S</given-names>
</name>
<name><surname>Fernandes</surname>
<given-names>AR</given-names>
</name>
<name><surname>Mira</surname>
<given-names>NP</given-names>
</name>
<name><surname>Alenquer</surname>
<given-names>M</given-names>
</name>
<name><surname>Freitas</surname>
<given-names>AT</given-names>
</name>
<name><surname>Oliveira</surname>
<given-names>AL</given-names>
</name>
<name><surname>Sá-Correia</surname>
<given-names>I</given-names>
</name>
<article-title>The YEASTRACT database: a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae</article-title>
<source>Nucleic Acids Res</source>
<year>2006</year>
<volume>34</volume>
<fpage>D446</fpage>
<lpage>D451</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkj013</pub-id>
<pub-id pub-id-type="pmid">16381908</pub-id>
</mixed-citation>
</ref>
<ref id="B42"><mixed-citation publication-type="journal"><name><surname>Thomas-Chollier</surname>
<given-names>M</given-names>
</name>
<name><surname>Sand</surname>
<given-names>O</given-names>
</name>
<name><surname>Turatsinze</surname>
<given-names>JV</given-names>
</name>
<name><surname>Janky</surname>
<given-names>R</given-names>
</name>
<name><surname>Defrance</surname>
<given-names>M</given-names>
</name>
<name><surname>Vervisch</surname>
<given-names>E</given-names>
</name>
<name><surname>Brohee</surname>
<given-names>S</given-names>
<suffix>van Helden</suffix>
</name>
<article-title>RSAT: regulatory sequence analysis tools</article-title>
<source>Nucleic Acids Res</source>
<year>2008</year>
<volume>36</volume>
<fpage>W119</fpage>
<lpage>W127</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkn304</pub-id>
<pub-id pub-id-type="pmid">18495751</pub-id>
</mixed-citation>
</ref>
<ref id="B43"><mixed-citation publication-type="other"><name><surname>Carvalho</surname>
<given-names>AM</given-names>
</name>
<name><surname>Freitas</surname>
<given-names>AT</given-names>
</name>
<name><surname>Oliveira</surname>
<given-names>AL</given-names>
</name>
<name><surname>Sagot</surname>
<given-names>MF</given-names>
</name>
<article-title>A parallel algorithm for the extraction of structured motifs</article-title>
<source>Proceedings of the 19th ACM Symposium on Applied Computing (SAC’04)</source>
<year>2004</year>
<fpage>147</fpage>
<lpage>153</lpage>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/TelematiV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 0003249 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 0003249 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    TelematiV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

This area was generated with Dilib version V0.6.31.
Data generation: Thu Nov 2 16:09:04 2017. Site generation: Sun Mar 10 16:42:28 2024

	Serveur d'exploration sur la télématique
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur la télématique

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri