Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000946 ( Pmc/Corpus ); précédent : 0009459; suivant : 0009470 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Efficient algorithms for biological stems search</title>
<author>
<name sortKey="Mi, Tian" sort="Mi, Tian" uniqKey="Mi T" first="Tian" last="Mi">Tian Mi</name>
<affiliation>
<nlm:aff id="I1">Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Rajasekaran, Sanguthevar" sort="Rajasekaran, Sanguthevar" uniqKey="Rajasekaran S" first="Sanguthevar" last="Rajasekaran">Sanguthevar Rajasekaran</name>
<affiliation>
<nlm:aff id="I1">Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">23679045</idno>
<idno type="pmc">3679804</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3679804</idno>
<idno type="RBID">PMC:3679804</idno>
<idno type="doi">10.1186/1471-2105-14-161</idno>
<date when="2013">2013</date>
<idno type="wicri:Area/Pmc/Corpus">000946</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000946</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Efficient algorithms for biological stems search</title>
<author>
<name sortKey="Mi, Tian" sort="Mi, Tian" uniqKey="Mi T" first="Tian" last="Mi">Tian Mi</name>
<affiliation>
<nlm:aff id="I1">Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Rajasekaran, Sanguthevar" sort="Rajasekaran, Sanguthevar" uniqKey="Rajasekaran S" first="Sanguthevar" last="Rajasekaran">Sanguthevar Rajasekaran</name>
<affiliation>
<nlm:aff id="I1">Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2013">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>Motifs are significant patterns in DNA, RNA, and protein sequences, which play an important role in biological processes and functions, like identification of open reading frames, RNA transcription, protein binding, etc. Several versions of the motif search problem have been studied in the literature. One such version is called the Planted Motif Search (PMS)or (
<italic>l, d</italic>
)-motif Search. PMS is known to be NP complete. The time complexities of most of the planted motif search algorithms depend exponentially on the alphabet size. Recently a new version of the motif search problem has been introduced by Kuksa and Pavlovic. We call this version as the
<italic>Motif Stems Search (MSS)</italic>
problem. A motif stem is an
<italic>l</italic>
-mer (for some relevant value of
<italic>l</italic>
)with some wildcard characters and hence corresponds to a set of
<italic>l</italic>
-mers (without wildcards), some of which are (
<italic>l, d</italic>
)-motifs. Kuksa and Pavlovic have presented an efficient algorithm to find motif stems for inputs from large alphabets. Ideally, the number of stems output should be as small as possible since the stems form a superset of the motifs.</p>
</sec>
<sec>
<title>Results</title>
<p>In this paper we propose an efficient algorithm for MSS and evaluate it on both synthetic and real data. This evaluation reveals that our algorithm is much faster than Kuksa and Pavlovic’s algorithm.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>Our MSS algorithm outperforms the algorithm of Kuksa and Pavlovic in terms of the run time as well as the number of stems output. Specifically, the stems output by our algorithm form a proper (and much smaller)subset of the stems output by Kuksa and Pavlovic’s algorithm.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Mi, T" uniqKey="Mi T">T Mi</name>
</author>
<author>
<name sortKey="Merlin, Jc" uniqKey="Merlin J">JC Merlin</name>
</author>
<author>
<name sortKey="Deverasetty, S" uniqKey="Deverasetty S">S Deverasetty</name>
</author>
<author>
<name sortKey="Gryk, Mr" uniqKey="Gryk M">MR Gryk</name>
</author>
<author>
<name sortKey="Bill, Tj" uniqKey="Bill T">TJ Bill</name>
</author>
<author>
<name sortKey="Brooks, Aw" uniqKey="Brooks A">AW Brooks</name>
</author>
<author>
<name sortKey="Lee, Ly" uniqKey="Lee L">LY Lee</name>
</author>
<author>
<name sortKey="Rathnayake, V" uniqKey="Rathnayake V">V Rathnayake</name>
</author>
<author>
<name sortKey="Ross, Ca" uniqKey="Ross C">CA Ross</name>
</author>
<author>
<name sortKey="Sargeant, Dp" uniqKey="Sargeant D">DP Sargeant</name>
</author>
<author>
<name sortKey="Strong, Cl" uniqKey="Strong C">CL Strong</name>
</author>
<author>
<name sortKey="Watts, P" uniqKey="Watts P">P Watts</name>
</author>
<author>
<name sortKey="Rajasekaran, S" uniqKey="Rajasekaran S">S Rajasekaran</name>
</author>
<author>
<name sortKey="Schiller, Mr" uniqKey="Schiller M">MR Schiller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gould, Cm" uniqKey="Gould C">CM Gould</name>
</author>
<author>
<name sortKey="Diella, F" uniqKey="Diella F">F Diella</name>
</author>
<author>
<name sortKey="Via, A" uniqKey="Via A">A Via</name>
</author>
<author>
<name sortKey="Puntervoll, P" uniqKey="Puntervoll P">P Puntervoll</name>
</author>
<author>
<name sortKey="Gemund, C" uniqKey="Gemund C">C Gemund</name>
</author>
<author>
<name sortKey="Chabanis Davidson, S" uniqKey="Chabanis Davidson S">S Chabanis-Davidson</name>
</author>
<author>
<name sortKey="Michael, S" uniqKey="Michael S">S Michael</name>
</author>
<author>
<name sortKey="Sayadi, A" uniqKey="Sayadi A">A Sayadi</name>
</author>
<author>
<name sortKey="Bryne, Jc" uniqKey="Bryne J">JC Bryne</name>
</author>
<author>
<name sortKey="Chica, C" uniqKey="Chica C">C Chica</name>
</author>
<author>
<name sortKey="Seiler, M" uniqKey="Seiler M">M Seiler</name>
</author>
<author>
<name sortKey="Davey, Ne" uniqKey="Davey N">NE Davey</name>
</author>
<author>
<name sortKey="Haslam, Nj" uniqKey="Haslam N">NJ Haslam</name>
</author>
<author>
<name sortKey="Weatheritt, Rj" uniqKey="Weatheritt R">RJ Weatheritt</name>
</author>
<author>
<name sortKey="Budd, A" uniqKey="Budd A">A Budd</name>
</author>
<author>
<name sortKey="Hughes, T" uniqKey="Hughes T">T Hughes</name>
</author>
<author>
<name sortKey="Pas, J" uniqKey="Pas J">J Pas</name>
</author>
<author>
<name sortKey="Rychlewski, L" uniqKey="Rychlewski L">L Rychlewski</name>
</author>
<author>
<name sortKey="Trave, G" uniqKey="Trave G">G Trave</name>
</author>
<author>
<name sortKey="Aasland, R" uniqKey="Aasland R">R Aasland</name>
</author>
<author>
<name sortKey="Helmer Citterich, M" uniqKey="Helmer Citterich M">M Helmer-Citterich</name>
</author>
<author>
<name sortKey="Linding, R" uniqKey="Linding R">R Linding</name>
</author>
<author>
<name sortKey="Gibson, Tj" uniqKey="Gibson T">TJ Gibson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Obenauer, Jc" uniqKey="Obenauer J">JC Obenauer</name>
</author>
<author>
<name sortKey="Cantley, Lc" uniqKey="Cantley L">LC Cantley</name>
</author>
<author>
<name sortKey="Yaffe, Mb" uniqKey="Yaffe M">MB Yaffe</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
<author>
<name sortKey="Hoi Sze, S" uniqKey="Hoi Sze S">S hoi Sze</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Price, A" uniqKey="Price A">A Price</name>
</author>
<author>
<name sortKey="Ramabhadran, S" uniqKey="Ramabhadran S">S Ramabhadran</name>
</author>
<author>
<name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Buhler, J" uniqKey="Buhler J">J Buhler</name>
</author>
<author>
<name sortKey="Tompa, M" uniqKey="Tompa M">M Tompa</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bailey, Tl" uniqKey="Bailey T">TL Bailey</name>
</author>
<author>
<name sortKey="Elkan, C" uniqKey="Elkan C">C Elkan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lawrence, Ce" uniqKey="Lawrence C">CE Lawrence</name>
</author>
<author>
<name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author>
<name sortKey="Boguski, Ms" uniqKey="Boguski M">MS Boguski</name>
</author>
<author>
<name sortKey="Liu, Js" uniqKey="Liu J">JS Liu</name>
</author>
<author>
<name sortKey="Neuwald, Af" uniqKey="Neuwald A">AF Neuwald</name>
</author>
<author>
<name sortKey="Wootton, Jc" uniqKey="Wootton J">JC Wootton</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rocke, E" uniqKey="Rocke E">E Rocke</name>
</author>
<author>
<name sortKey="Tompa, M" uniqKey="Tompa M">M Tompa</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Keich, U" uniqKey="Keich U">U Keich</name>
</author>
<author>
<name sortKey="Pevzner, P" uniqKey="Pevzner P">P Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Timothy, Lb" uniqKey="Timothy L">LB Timothy</name>
</author>
<author>
<name sortKey="Elkan, C" uniqKey="Elkan C">C Elkan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hertz, Gz" uniqKey="Hertz G">GZ Hertz</name>
</author>
<author>
<name sortKey="Stormo, Gd" uniqKey="Stormo G">GD Stormo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Eskin, E" uniqKey="Eskin E">E Eskin</name>
</author>
<author>
<name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pisanti, N" uniqKey="Pisanti N">N Pisanti</name>
</author>
<author>
<name sortKey="Carvalho, A" uniqKey="Carvalho A">A Carvalho</name>
</author>
<author>
<name sortKey="Marsan, L" uniqKey="Marsan L">L Marsan</name>
</author>
<author>
<name sortKey="Sagot, Mf" uniqKey="Sagot M">MF Sagot</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rajasekaran, S" uniqKey="Rajasekaran S">S Rajasekaran</name>
</author>
<author>
<name sortKey="Balla, S" uniqKey="Balla S">S Balla</name>
</author>
<author>
<name sortKey="Hsi Huang, C" uniqKey="Hsi Huang C">C hsi Huang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chin, Fyl" uniqKey="Chin F">FYL Chin</name>
</author>
<author>
<name sortKey="Leung, Hcm" uniqKey="Leung H">HCM Leung</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Davila, J" uniqKey="Davila J">J Davila</name>
</author>
<author>
<name sortKey="Balla, S" uniqKey="Balla S">S Balla</name>
</author>
<author>
<name sortKey="Rajasekaran, S" uniqKey="Rajasekaran S">S Rajasekaran</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rajasekaran, S" uniqKey="Rajasekaran S">S Rajasekaran</name>
</author>
<author>
<name sortKey="Dinh, H" uniqKey="Dinh H">H Dinh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dinh, H" uniqKey="Dinh H">H Dinh</name>
</author>
<author>
<name sortKey="Rajasekaran, S" uniqKey="Rajasekaran S">S Rajasekaran</name>
</author>
<author>
<name sortKey="Kundeti, V" uniqKey="Kundeti V">V Kundeti</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bandyopadhyay, S" uniqKey="Bandyopadhyay S">S Bandyopadhyay</name>
</author>
<author>
<name sortKey="Sahni, S" uniqKey="Sahni S">S Sahni</name>
</author>
<author>
<name sortKey="Rajasekaran, S" uniqKey="Rajasekaran S">S Rajasekaran</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kuksa, P" uniqKey="Kuksa P">P Kuksa</name>
</author>
<author>
<name sortKey="Pavlovic, V" uniqKey="Pavlovic V">V Pavlovic</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Davila, J" uniqKey="Davila J">J Davila</name>
</author>
<author>
<name sortKey="Balla, S" uniqKey="Balla S">S Balla</name>
</author>
<author>
<name sortKey="Rajasekaran, S" uniqKey="Rajasekaran S">S Rajasekaran</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article" xml:lang="en">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">23679045</article-id>
<article-id pub-id-type="pmc">3679804</article-id>
<article-id pub-id-type="publisher-id">1471-2105-14-161</article-id>
<article-id pub-id-type="doi">10.1186/1471-2105-14-161</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Efficient algorithms for biological stems search</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" id="A1">
<name>
<surname>Mi</surname>
<given-names>Tian</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>tian.mi@engr.uconn.edu</email>
</contrib>
<contrib contrib-type="author" corresp="yes" id="A2">
<name>
<surname>Rajasekaran</surname>
<given-names>Sanguthevar</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>rajasek@engr.uconn.edu</email>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, USA</aff>
<pub-date pub-type="collection">
<year>2013</year>
</pub-date>
<pub-date pub-type="epub">
<day>16</day>
<month>5</month>
<year>2013</year>
</pub-date>
<volume>14</volume>
<fpage>161</fpage>
<lpage>161</lpage>
<history>
<date date-type="received">
<day>30</day>
<month>7</month>
<year>2012</year>
</date>
<date date-type="accepted">
<day>6</day>
<month>5</month>
<year>2013</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright © 2013 Mi and Rajasekaran; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2013</copyright-year>
<copyright-holder>Mi and Rajasekaran; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1471-2105/14/161"></self-uri>
<abstract>
<sec>
<title>Background</title>
<p>Motifs are significant patterns in DNA, RNA, and protein sequences, which play an important role in biological processes and functions, like identification of open reading frames, RNA transcription, protein binding, etc. Several versions of the motif search problem have been studied in the literature. One such version is called the Planted Motif Search (PMS)or (
<italic>l, d</italic>
)-motif Search. PMS is known to be NP complete. The time complexities of most of the planted motif search algorithms depend exponentially on the alphabet size. Recently a new version of the motif search problem has been introduced by Kuksa and Pavlovic. We call this version as the
<italic>Motif Stems Search (MSS)</italic>
problem. A motif stem is an
<italic>l</italic>
-mer (for some relevant value of
<italic>l</italic>
)with some wildcard characters and hence corresponds to a set of
<italic>l</italic>
-mers (without wildcards), some of which are (
<italic>l, d</italic>
)-motifs. Kuksa and Pavlovic have presented an efficient algorithm to find motif stems for inputs from large alphabets. Ideally, the number of stems output should be as small as possible since the stems form a superset of the motifs.</p>
</sec>
<sec>
<title>Results</title>
<p>In this paper we propose an efficient algorithm for MSS and evaluate it on both synthetic and real data. This evaluation reveals that our algorithm is much faster than Kuksa and Pavlovic’s algorithm.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>Our MSS algorithm outperforms the algorithm of Kuksa and Pavlovic in terms of the run time as well as the number of stems output. Specifically, the stems output by our algorithm form a proper (and much smaller)subset of the stems output by Kuksa and Pavlovic’s algorithm.</p>
</sec>
</abstract>
</article-meta>
</front>
<body>
<sec>
<title>Background</title>
<p>Motifs, or sequence motifs, are patterns of nucleotides or amino acids. Motifs are often related to primer selection, transcription factor binding sites, mRNA processing, transcription termination, etc. Sequence motifs of proteins are typically involved in functions such as binding to a target protein, protein trafficking, post-translational modifications, and so on. Motif search problem has been studied extensively due to its pivotal biological significance. Several types of algorithms have been proposed for motif search. In one such class of methods, putative motifs in an input biological query sequence are predicted based on a database of known motifs. Examples include [
<xref ref-type="bibr" rid="B1">1</xref>
-
<xref ref-type="bibr" rid="B3">3</xref>
]. In another class of methods, motifs are assumed to appear frequently in a set of sequences, like the same protein sequence from different species. Here the problem of motif search is reduced to that of finding subsequences that occur in many of the input sequences. Planted motif search (PMS)is one such formulation.</p>
<p>Numerous algorithms have been proposed to solve the PMS problem. The WINNOWER algorithm uses a graph to transform this problem into one of finding large cliques in the graph [
<xref ref-type="bibr" rid="B4">4</xref>
]. The PatternBranching algorithm introduces a scoring technique for all the motif candidates [
<xref ref-type="bibr" rid="B5">5</xref>
]. The PROJECTION algorithm repeatedly picks several random positions and uses a hash table with a threshold to limit the motif candidates [
<xref ref-type="bibr" rid="B6">6</xref>
]. Bailey 1995 [
<xref ref-type="bibr" rid="B7">7</xref>
] employs expectation maximization algorithms while Gibbs sampling is used in [
<xref ref-type="bibr" rid="B8">8</xref>
,
<xref ref-type="bibr" rid="B9">9</xref>
]. MULTIPROFILER [
<xref ref-type="bibr" rid="B10">10</xref>
], MEME [
<xref ref-type="bibr" rid="B11">11</xref>
], and CONSENSUS [
<xref ref-type="bibr" rid="B12">12</xref>
] are also known PMS algorithms.</p>
<p>An
<italic>exact</italic>
PMS algorithm always outputs all the motifs present in a given set of sequences. MITRA employs a mismatch tree structure to generate the motif candidates efficiently [
<xref ref-type="bibr" rid="B13">13</xref>
]. RISOTTO constructs a suffix tree to compare sequences [
<xref ref-type="bibr" rid="B14">14</xref>
]. PMS1 [
<xref ref-type="bibr" rid="B15">15</xref>
] considers all the motif candidates and evaluates them using sorting. Voting uses a hash table to locate the motifs [
<xref ref-type="bibr" rid="B16">16</xref>
]. PMS2 is based on PMS1 and it extends smaller motifs to get longer motifs, and PMS3 forms a motif of a desired length using two smaller motifs [
<xref ref-type="bibr" rid="B15">15</xref>
]. PMSPrune introduces a tree structure for the motif candidates and uses a branch-and-bound algorithm to reduce the search space [
<xref ref-type="bibr" rid="B17">17</xref>
]. PMS4 is a speed-up technique that finds a superset of the motifs using a subset of the input sequences and validates those candidates [
<xref ref-type="bibr" rid="B18">18</xref>
]. PMS5 employs an Integer Linear Programming (ILP)as the branch-bound algorithm for a speedup [
<xref ref-type="bibr" rid="B19">19</xref>
]. PMS6 uses the solutions of such ILPs to generate motif candidates [
<xref ref-type="bibr" rid="B20">20</xref>
].</p>
<p>Most of the work on exact algorithms for PMS has focussed on DNA or RNA sequences where |
<italic>Σ</italic>
|=4. Little work has been done on larger alphabet sizes, such as on proteins. A recent work focusses on protein sequences [
<xref ref-type="bibr" rid="B21">21</xref>
]. However, the stemming algorithm proposed in this paper does not solve the PMS problem. In particular, it does not find motifs but rather
<italic>motif stems</italic>
. A motif stem (denoted as stem from hereon)can be thought of as an
<italic>l</italic>
-mer (for some relevant value of
<italic>l</italic>
)with some wild cards present in it. As a result, a stem stands for a set of motifs. A stemming algorithm generates stems (or motifs with wildcards)to represent motifs for large-alphabet inputs [
<xref ref-type="bibr" rid="B21">21</xref>
]. The stemming algorithm of [
<xref ref-type="bibr" rid="B21">21</xref>
] generates a very large set of stems (and hence a very large superset of motifs). In this paper we propose two algorithms for Motif Stems Search, MSS1 and MSS2, which outperform the stemming algorithm of [
<xref ref-type="bibr" rid="B21">21</xref>
]. The new algorithms generate a much smaller set of stems. The stems generated by the algorithm of [
<xref ref-type="bibr" rid="B21">21</xref>
] as well as MSS1 and MSS2 are guaranteed to be supersets of all the motifs present in a given set of input sequences.</p>
</sec>
<sec sec-type="methods">
<title>Methods</title>
<sec>
<title>Motif search on large alphabets</title>
<p>In this section we provide some definitions pertinent to PMS and MSS problems.</p>
<sec>
<title>Definition 1</title>
<p>A sequence
<italic>x</italic>
=
<italic>x</italic>
[1,2,…,
<italic>l</italic>
] (|
<italic>x</italic>
|=
<italic>l</italic>
)on
<italic>Σ</italic>
(
<italic>x</italic>
[
<italic>i</italic>
]∈
<italic>Σ</italic>
, 1≤
<italic>i</italic>
<italic>l</italic>
)is an
<italic>l</italic>
-mer.</p>
</sec>
<sec>
<title>Definition 2</title>
<p>Given two
<italic>l</italic>
-mers
<italic>x</italic>
and
<italic>y</italic>
, the Hamming distance between two
<italic>l</italic>
-mer is defined as:</p>
<p>
<disp-formula>
<mml:math id="M1" name="1471-2105-14-161-i1" overflow="scroll">
<mml:mrow>
<mml:mtext mathvariant="italic">HD</mml:mtext>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>y</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfenced open="|" close="|">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>|</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>]</mml:mo>
<mml:mo></mml:mo>
<mml:mi>y</mml:mi>
<mml:mo>[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>]</mml:mo>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
</sec>
<sec>
<title>Definition 3</title>
<p>Given an
<italic>l</italic>
mer
<italic>x</italic>
and a sequence
<italic>s</italic>
of length
<italic>m</italic>
, the Hamming distance between
<italic>x</italic>
and
<italic>s</italic>
is defined as</p>
<p>
<disp-formula>
<mml:math id="M2" name="1471-2105-14-161-i2" overflow="scroll">
<mml:mrow>
<mml:mtext mathvariant="italic">HD</mml:mtext>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mtext>min</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mi>m</mml:mi>
<mml:mo></mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munder>
<mml:mo>{</mml:mo>
<mml:mtext mathvariant="italic">HD</mml:mtext>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>]</mml:mo>
<mml:mo>)</mml:mo>
<mml:mo>}</mml:mo>
<mml:mi>.</mml:mi>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
</sec>
<sec>
<title>Definition 4</title>
<p>
<bold>(Planted Motif Search (PMS)Problem).</bold>
Let
<italic>S</italic>
be a set of
<italic>n</italic>
sequences of length
<italic>m</italic>
each on an alphabet
<italic>Σ</italic>
. Specifically, let
<italic>S</italic>
={
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
||
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
|=
<italic>m</italic>
, 1≤
<italic>i</italic>
<italic>n</italic>
}. The
<italic>planted motif search problem</italic>
or
<italic>(l,d)motif search problem</italic>
takes as input
<italic>S</italic>
and two integers
<italic>l</italic>
and
<italic>d</italic>
, and finds every string
<italic>x</italic>
of length
<italic>l</italic>
such that for every
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
, the Hamming distance between
<italic>x</italic>
and
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
is no more than
<italic>d</italic>
. In particular, we want to compute the following set:</p>
<p>
<disp-formula>
<mml:math id="M3" name="1471-2105-14-161-i3" overflow="scroll">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>|</mml:mo>
<mml:mfenced open="|" close="|">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>,</mml:mo>
<mml:mtext mathvariant="italic">HD</mml:mtext>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mo></mml:mo>
<mml:mi>d</mml:mi>
<mml:mo>,</mml:mo>
<mml:mspace width="1pt"></mml:mspace>
<mml:mtext>for</mml:mtext>
<mml:mspace width="3pt"></mml:mspace>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>}</mml:mo>
<mml:mi>.</mml:mi>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Any such
<italic>x</italic>
is called an (
<italic>l,d</italic>
)-motif. Any
<italic>l</italic>
-mer of
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
that is at a Hamming distance of ≤
<italic>d</italic>
from
<italic>x</italic>
is called
<italic>an occurrence or instance of x</italic>
.</p>
</sec>
<sec>
<title>Definition 5</title>
<p>Given an
<italic>l</italic>
-mer
<italic>x</italic>
, the
<italic>d</italic>
-neighbors of
<italic>x</italic>
is defined to be {
<italic>y</italic>
: |
<italic>y</italic>
|=
<italic>l</italic>
and
<italic>H</italic>
<italic>D</italic>
(
<italic>x</italic>
,
<italic>y</italic>
)≤
<italic>d</italic>
}. The
<italic>d</italic>
-neighbors of
<italic>x</italic>
in any sequence
<italic>s</italic>
is defined to be the intersection of the
<italic>d</italic>
-neighbors of
<italic>x</italic>
and the set of
<italic>l</italic>
-mers in
<italic>s</italic>
.</p>
</sec>
<sec>
<title>Observation 1</title>
<p>If the Hamming distance between two
<italic>l</italic>
-mers
<italic>x</italic>
<sub>1</sub>
and
<italic>x</italic>
<sub>2</sub>
is larger than 2
<italic>d</italic>
(i.e.,
<italic>H</italic>
<italic>D</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
)≥2
<italic>d</italic>
)then no
<italic>l</italic>
-mer
<italic>x</italic>
<sub>3</sub>
exists such that
<italic>H</italic>
<italic>D</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>3</sub>
)≤
<italic>d</italic>
and
<italic>H</italic>
<italic>D</italic>
(
<italic>x</italic>
<sub>2</sub>
,
<italic>x</italic>
<sub>3</sub>
)≤
<italic>d</italic>
.</p>
<p>PMS algorithms are typically tested on random data with
<italic>n</italic>
=20 and
<italic>m</italic>
=600. Each input string is randomly generated such that each symbol in each string is equally likely to be any character from the alphabet. A motif is generated randomly. Randomly mutated versions of this motif are planted in the input strings (one mutated motif per string). For a given value of
<italic>l</italic>
, we call the pair (
<italic>l,d</italic>
)a
<italic>challenging instance</italic>
if
<italic>d</italic>
is the smallest value for which the number of (
<italic>l,d</italic>
)-motifs occurring in the input strings by random chance is ≥1. Some of the challenging instances are: (9, 2), (11, 3), (13, 4), (15, 5), (17, 6), (19, 7), and so on. One of the performance measures of interest for any exact algorithm is the largest challenging instance that it can solve. MITRA can solve the instance of (13, 4)[
<xref ref-type="bibr" rid="B13">13</xref>
], and RISOTTO [
<xref ref-type="bibr" rid="B14">14</xref>
] and Voting [
<xref ref-type="bibr" rid="B16">16</xref>
] successfully run on (15, 5). PMSPrune solves up to (19, 7)[
<xref ref-type="bibr" rid="B17">17</xref>
]. PMS5 [
<xref ref-type="bibr" rid="B19">19</xref>
] and PMS6 [
<xref ref-type="bibr" rid="B20">20</xref>
] can handle (23, 9). These statistics are based on DNA sequences where |
<italic>Σ</italic>
|=4.</p>
<p>The time complexities of exact algorithms typically depend exponentially on the size of
<italic>Σ</italic>
. PMS0 takes
<inline-formula>
<mml:math id="M4" name="1471-2105-14-161-i4" overflow="scroll">
<mml:mi>O</mml:mi>
<mml:mo>(</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mtext mathvariant="italic">nl</mml:mtext>
<mml:mfenced open="(" close=")">
<mml:mfrac linethickness="0">
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mfenced>
<mml:mo>|</mml:mo>
<mml:mi>Σ</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>)</mml:mo>
</mml:math>
</inline-formula>
time, and PMS1 costs
<inline-formula>
<mml:math id="M5" name="1471-2105-14-161-i5" overflow="scroll">
<mml:mi>O</mml:mi>
<mml:mo>(</mml:mo>
<mml:mtext mathvariant="italic">mn</mml:mtext>
<mml:mfenced open="(" close=")">
<mml:mfrac linethickness="0">
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mfenced>
<mml:mo>|</mml:mo>
<mml:mi>Σ</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mfrac>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>)</mml:mo>
</mml:math>
</inline-formula>
time where
<italic>w</italic>
is the word length of the computer [
<xref ref-type="bibr" rid="B15">15</xref>
]. It needs
<inline-formula>
<mml:math id="M6" name="1471-2105-14-161-i6" overflow="scroll">
<mml:mi>O</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>m</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:msup>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mfenced open="(" close=")">
<mml:mfrac linethickness="0">
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mfenced>
<mml:mo>|</mml:mo>
<mml:mi>Σ</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>)</mml:mo>
</mml:math>
</inline-formula>
time for RISOTTO [
<xref ref-type="bibr" rid="B14">14</xref>
],
<inline-formula>
<mml:math id="M7" name="1471-2105-14-161-i7" overflow="scroll">
<mml:mi>O</mml:mi>
<mml:mo>(</mml:mo>
<mml:mtext mathvariant="italic">mn</mml:mtext>
<mml:mfenced open="(" close=")">
<mml:mfrac linethickness="0">
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mfenced>
<mml:mo>|</mml:mo>
<mml:mi>Σ</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>)</mml:mo>
</mml:math>
</inline-formula>
for Voting [
<xref ref-type="bibr" rid="B16">16</xref>
], and
<inline-formula>
<mml:math id="M8" name="1471-2105-14-161-i8" overflow="scroll">
<mml:mi>O</mml:mi>
<mml:mo>(</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mi>n</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mfrac linethickness="0">
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mfenced>
<mml:mo>|</mml:mo>
<mml:mi>Σ</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>)</mml:mo>
</mml:math>
</inline-formula>
for PMSPrune [
<xref ref-type="bibr" rid="B17">17</xref>
].</p>
<p>When the size of the alphabet is large (e.g., |
<italic>Σ</italic>
|=20 for proteins), the above exact algorithms will take a very long time. Kuksa and Pavlovic have introduced a new version of the motif search problem and proposed an efficient algorithm to solve it on large alphabets. A motif
<italic>stem</italic>
is an
<italic>l</italic>
-mer with wildcards. Thus a stem represents a set of
<italic>l</italic>
-mers without wildcards. For example, if
<italic>g</italic>
<italic>a</italic>
<italic>c</italic>
<italic>c</italic>
is a DNA stem, it represents the following 5-mers without wildcards:
<italic>g</italic>
<italic>g</italic>
<italic>a</italic>
<italic>a</italic>
<italic>c</italic>
,
<italic>g</italic>
<italic>c</italic>
<italic>a</italic>
<italic>a</italic>
<italic>c</italic>
,
<italic>g</italic>
<italic>t</italic>
<italic>a</italic>
<italic>a</italic>
<italic>c</italic>
, and
<italic>gaaac</italic>
. Given a set of strings from some alphabet, the problem of finding motif stems in them is known as the
<italic>Motif Stem Search (MSS)</italic>
problem. We focus on MSS in this paper.</p>
</sec>
<sec>
<title>Definition 6</title>
<p>Motif Stem Search (MSS)Problem. Input are
<italic>N</italic>
sequences and two integers
<italic>l</italic>
and
<italic>d</italic>
. The problem is to find a set of stems such that the set of
<italic>l</italic>
-mers represented by these stems is a superset of all the (
<italic>l,d</italic>
)-motifs present in the
<italic>N</italic>
sequences.</p>
<p>As stated above, there are many possible solutions to the MSS problem. The challenge then is to come up with a superset as small as possible which covers all the (
<italic>l,d</italic>
)-motifs. In other words, we want the number of
<italic>l</italic>
-mers (without wildcards)represented by the stems to be as small as possible.</p>
</sec>
<sec>
<title>MSS1 - A basic Algorithm</title>
<p>Based on OBSERVATION 1, if the Hamming distance between an
<italic>l</italic>
-mer
<italic>x</italic>
and a sequence
<italic>s</italic>
is larger than 2
<italic>d</italic>
, then no
<italic>l</italic>
-mer
<italic>x</italic>
’ exists such that
<italic>H</italic>
<italic>D</italic>
(
<italic>x</italic>
,
<italic>x</italic>
<sup></sup>
)≤
<italic>d</italic>
and
<italic>H</italic>
<italic>D</italic>
(
<italic>x</italic>
<sup></sup>
,
<italic>s</italic>
)≤
<italic>d</italic>
. This leads us to the following observation.</p>
<sec>
<title>Observation 2</title>
<p>Given an
<italic>l</italic>
-mer
<italic>x</italic>
, if ∃
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
such that
<italic>H</italic>
<italic>D</italic>
(
<italic>x</italic>
,
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
)>2
<italic>d</italic>
, then none of
<italic>x</italic>
’s
<italic>d</italic>
-neighbors can be a motif.</p>
<p>The stemming algorithm of [
<xref ref-type="bibr" rid="B21">21</xref>
] works as follows. It makes use of OBSERVATION 2 crucially. OBSERVATION 2 states that an
<italic>l</italic>
-mer
<italic>x</italic>
in any input string cannot be an instance of an (
<italic>l,d</italic>
)-motif if there exists at least one input string
<italic>s</italic>
such that
<italic>H</italic>
<italic>D</italic>
(
<italic>x</italic>
,
<italic>s</italic>
)>2
<italic>d</italic>
. The algorithm of [
<xref ref-type="bibr" rid="B21">21</xref>
] first identifies a set
<italic>I</italic>
of possible motif instances. An
<italic>l</italic>
-mer
<italic>x</italic>
in any input string
<italic>s</italic>
will be included in
<italic>I</italic>
if and only if
<italic>H</italic>
<italic>D</italic>
(
<italic>x</italic>
,
<italic>s</italic>
<sup></sup>
)≤2
<italic>d</italic>
for every input string
<italic>s</italic>
’. Having found such a set
<italic>I</italic>
, the algorithm then uses
<italic>I</italic>
to generate stems. The stems are found as follows: For every
<italic>x</italic>
,
<italic>y</italic>
<italic>I</italic>
, the algorithm generates the common
<italic>d</italic>
-neighbors of
<italic>x</italic>
and
<italic>y</italic>
as stems. The union of all such stems will constitute candidate motif stems. This union is a superset of the motif stems. Finally, for each candidate stem, the algorithm checks if this is a correct answer or not. All valid stems that pass this test are output.</p>
<p>In Algorithm 1 and Algorithm 2 we present a faster algorithm (than that of [
<xref ref-type="bibr" rid="B21">21</xref>
])for generating motif stems. In Algorithm 1 we present an algorithm for generating the set
<italic>I</italic>
. The cardinality of
<italic>I</italic>
that we generate is a much smaller subset of the
<italic>I</italic>
generated in the stemming algorithm of [
<xref ref-type="bibr" rid="B21">21</xref>
]. For any pair of
<italic>l</italic>
-mers (
<italic>x</italic>
,
<italic>x</italic>
<sup></sup>
)in the set
<italic>I</italic>
, we begin with
<italic>x</italic>
and replace some characters in
<italic>x</italic>
with wildcards to generate MSS candidates. The positions in which
<italic>x</italic>
and
<italic>x</italic>
’ match are referred to as the
<italic>matching region</italic>
and the positions in which
<italic>x</italic>
and
<italic>x</italic>
’ differ are referred to as the
<italic>non-matching region</italic>
. The wildcards can be placed in the matching region and/or the non-matching region. Any stem
<italic>t</italic>
is generated by placing wildcards in
<italic>x</italic>
. Therefore, wildcards in the generated stem
<italic>t</italic>
are always treated as mismatches between
<italic>t</italic>
and
<italic>x</italic>
, independent of whether they are in the matching or the non-matching region. However, for
<italic>x</italic>
’, in the non-matching region, wildcards in the generated stem
<italic>t</italic>
are assumed to be matches between
<italic>t</italic>
and
<italic>x</italic>
’ while in the matching region they are still treated as mismatches between
<italic>t</italic>
and
<italic>x</italic>
’. The number of wildcards is dependent on the Hamming distance between
<italic>x</italic>
and
<italic>x</italic>
’ and
<italic>d</italic>
. Let
<italic>H</italic>
<italic>D</italic>
(
<italic>x</italic>
,
<italic>x</italic>
<sup></sup>
)=
<italic>d</italic>
<sub>
<italic>x</italic>
</sub>
. Table
<xref ref-type="table" rid="T1">1</xref>
shows how many wildcards should be placed in different cases.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption>
<p>Numbers of wildcards</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="left"> </th>
<th align="center">
<bold>Non-matching region</bold>
</th>
<th align="center">
<bold>Matching region</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="bottom">
<italic>d</italic>
<sub>
<italic>x</italic>
</sub>
<italic>d</italic>
<hr></hr>
</td>
<td align="center" valign="bottom">0≤
<italic>i</italic>
<italic>d</italic>
<sub>
<italic>x</italic>
</sub>
<hr></hr>
</td>
<td align="center" valign="bottom">
<italic>d</italic>
−max(
<italic>i</italic>
,
<italic>d</italic>
<sub>
<italic>x</italic>
</sub>
<italic>i</italic>
)
<hr></hr>
</td>
</tr>
<tr>
<td align="left">
<italic>d</italic>
<sub>
<italic>x</italic>
</sub>
>
<italic>d</italic>
</td>
<td align="center">
<italic>d</italic>
<sub>
<italic>x</italic>
</sub>
<italic>d</italic>
<italic>i</italic>
<italic>d</italic>
</td>
<td align="center">
<italic>d</italic>
−max(
<italic>i</italic>
,
<italic>d</italic>
<sub>
<italic>x</italic>
</sub>
<italic>i</italic>
)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Assume that
<italic>i</italic>
wildcards are placed in the non-matching region of
<italic>x</italic>
to form
<italic>t</italic>
, resulting in
<italic>i</italic>
mismatches between
<italic>t</italic>
and
<italic>x</italic>
and (
<italic>d</italic>
<sub>
<italic>x</italic>
</sub>
<italic>i</italic>
)mismatches between
<italic>t</italic>
and
<italic>x</italic>
’. We consider the following two cases: </p>
<p>1.
<italic>d</italic>
<sub>
<italic>x</italic>
</sub>
<italic>d</italic>
: The number of wildcards
<italic>i</italic>
can vary from 0 to the size of the non-matching region. To make the total number of mismatches against
<italic>x</italic>
smaller than
<italic>d</italic>
, at most
<italic>d</italic>
<italic>i</italic>
wildcards can be placed in the matching region of
<italic>x</italic>
. Similarly, to make the total number of mismatches against
<italic>x</italic>
’ smaller than
<italic>d</italic>
, at most
<italic>d</italic>
−(
<italic>d</italic>
<sub>
<italic>x</italic>
</sub>
<italic>i</italic>
)wildcards can be placed in the matching region of
<italic>x</italic>
’.</p>
<p>2.
<italic>d</italic>
<sub>
<italic>x</italic>
</sub>
>
<italic>d</italic>
: At least
<italic>d</italic>
<sub>
<italic>x</italic>
</sub>
<italic>d</italic>
wildcards have to be placed in the non-matching region to eliminate the mismatches. Similar to case 1, in the matching region, at most
<italic>d</italic>
− max(
<italic>i</italic>
,
<italic>d</italic>
<sub>
<italic>x</italic>
</sub>
<italic>i</italic>
)wildcards can be placed.</p>
</sec>
<sec>
<title>Algorithm 1
<bold>MSS1</bold>
</title>
<p>In the matching region,
<italic>d</italic>
− max(
<italic>i</italic>
,
<italic>d</italic>
<sub>
<italic>x</italic>
</sub>
<italic>i</italic>
)is an upper bound on the number of wildcards. However, it is not necessary to enumerate all the cases from 0 to
<italic>d</italic>
− max(
<italic>i</italic>
,
<italic>d</italic>
<sub>
<italic>x</italic>
</sub>
<italic>i</italic>
). Similarly, it is not necessary to repeat stems generation for all pairs in
<italic>I</italic>
. For any
<italic>x</italic>
let
<inline-formula>
<mml:math id="M9" name="1471-2105-14-161-i10" overflow="scroll">
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo></mml:mo>
<mml:mi>I</mml:mi>
</mml:math>
</inline-formula>
be
<italic>x</italic>
’s 2
<italic>d</italic>
-neighbors in sequence
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
(i.e.,
<inline-formula>
<mml:math id="M10" name="1471-2105-14-161-i11" overflow="scroll">
<mml:msub>
<mml:mrow>
<mml:mi>I</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mo>{</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>}</mml:mo>
</mml:math>
</inline-formula>
)and let
<italic>O</italic>
<sub>
<italic>i</italic>
</sub>
be the set of motif instances in
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
. Then, clearly,
<italic>O</italic>
<sub>
<italic>i</italic>
</sub>
<italic>I</italic>
<sub>
<italic>i</italic>
</sub>
. The motifs form a subset of stems that can be obtained between
<italic>x</italic>
and each of
<italic>O</italic>
<sub>
<italic>i</italic>
</sub>
. The motif stems we generate are stems that can be obtained from
<italic>l</italic>
-mer pairs of
<inline-formula>
<mml:math id="M11" name="1471-2105-14-161-i12" overflow="scroll">
<mml:mo>(</mml:mo>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>)</mml:mo>
<mml:mo>,</mml:mo>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>)</mml:mo>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>)</mml:mo>
<mml:mo>)</mml:mo>
</mml:math>
</inline-formula>
. To minimize the number of stems generated from
<italic>I</italic>
, we have to use the sequence with the smallest
<italic>j</italic>
.</p>
</sec>
<sec>
<title>Observation 3</title>
<p>For any
<italic>l</italic>
-mer
<italic>x</italic>
, let its 2
<italic>d</italic>
-neighbors in sequence
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
be
<inline-formula>
<mml:math id="M12" name="1471-2105-14-161-i13" overflow="scroll">
<mml:msub>
<mml:mrow>
<mml:mi>I</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula>
(for 1≤
<italic>i</italic>
<italic>n</italic>
). Then, the (
<italic>l,d</italic>
)-motifs are included in the stems set, which is generated from the pairs
<inline-formula>
<mml:math id="M13" name="1471-2105-14-161-i14" overflow="scroll">
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>)</mml:mo>
<mml:mo>,</mml:mo>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>)</mml:mo>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>)</mml:mo>
</mml:math>
</inline-formula>
.</p>
<p>The detailed MSS1 algorithm is given in Algorithm 1 and Algorithm 2.</p>
<p>In lines 2-18 we find the sequence in which
<italic>x</italic>
has the minimum number of 2
<italic>d</italic>
-neighbors. Also, if one sequence has no 2
<italic>d</italic>
-neighbor, the current
<italic>l</italic>
-mer
<italic>x</italic>
is skipped (line 12). The stems are generated by placing wildcards in each pair of
<italic>x</italic>
×
<italic>I</italic>
<sub>
<italic>m</italic>
<italic>i</italic>
<italic>n</italic>
</sub>
, as shown in Algorithm 2.</p>
<p>Hamming distance is called
<italic>m</italic>
<sup>2</sup>
<italic>n</italic>
times. Therefore, excluding wildcards placement, Algorithm 1 takes
<italic>O</italic>
(
<italic>m</italic>
<sup>2</sup>
<italic>n</italic>
<italic>l</italic>
)time.</p>
<p>Wildcards placement procedure is called (
<italic>m</italic>
<italic>l</italic>
+1)times and each time |
<italic>I</italic>
<sub>
<italic>m</italic>
<italic>i</italic>
<italic>n</italic>
</sub>
|=
<italic>m</italic>
<italic>l</italic>
+1 in the worst case. Therefore, in this case, wildcards placement (line 4–16)in Algorithm 2 is run
<italic>O</italic>
(
<italic>m</italic>
<sup>2</sup>
)times. The number of wildcards is no more than
<italic>d</italic>
. So line 4–16 takes
<inline-formula>
<mml:math id="M14" name="1471-2105-14-161-i15" overflow="scroll">
<mml:mi>O</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mfrac linethickness="0">
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mfenced>
</mml:math>
</inline-formula>
time in the worst case. As a result, wildcards placement in MSS1 takes
<inline-formula>
<mml:math id="M15" name="1471-2105-14-161-i16" overflow="scroll">
<mml:mi>O</mml:mi>
<mml:mo>(</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mfenced open="(" close=")">
<mml:mfrac linethickness="0">
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mfenced>
<mml:mo>)</mml:mo>
</mml:math>
</inline-formula>
time. In the best case, wildcards placement procedure is only called once when all other
<italic>l</italic>
-mers in
<italic>s</italic>
<sub>1</sub>
have no 2
<italic>d</italic>
-neigbhors, and
<italic>I</italic>
<sub>
<italic>m</italic>
<italic>i</italic>
<italic>n</italic>
</sub>
=1. The best case for line 4–16 is when
<italic>d</italic>
<sub>
<italic>x</italic>
</sub>
=2
<italic>d</italic>
and it takes
<inline-formula>
<mml:math id="M16" name="1471-2105-14-161-i17" overflow="scroll">
<mml:mi>O</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mfrac linethickness="0">
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mfenced>
</mml:math>
</inline-formula>
time (see DISCUSSION for more analysis).</p>
<p>In summary, MSS1 takes
<italic>O</italic>
(
<italic>m</italic>
<sup>2</sup>
<italic>n</italic>
<italic>l</italic>
+|
<italic>s</italic>
<italic>t</italic>
<italic>e</italic>
<italic>m</italic>
<italic>s</italic>
|)time, where |
<italic>s</italic>
<italic>t</italic>
<italic>e</italic>
<italic>m</italic>
<italic>s</italic>
| is the total number of stems generated.</p>
</sec>
<sec>
<title>Algorithm 2
<bold>PlaceWildcards</bold>
</title>
</sec>
</sec>
<sec>
<title>MSS2 - A speedup Algorithm</title>
<p>The computation of the 2
<italic>d</italic>
-neighbors of
<italic>x</italic>
from
<italic>s</italic>
<sub>1</sub>
in a certain sequence
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
can be thought of as the calculation of a distance matrix between all (
<italic>m</italic>
<italic>l</italic>
+1)
<italic>l</italic>
-mers in
<italic>s</italic>
<sub>1</sub>
against those in
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
as shown in Figure 
<xref ref-type="fig" rid="F1">1</xref>
B. A straight forward algorithm takes
<italic>O</italic>
(
<italic>m</italic>
<sup>2</sup>
<italic>l</italic>
)time. When
<italic>i</italic>
ranges from 2 to
<italic>n</italic>
, the total time will be
<italic>O</italic>
(
<italic>m</italic>
<sup>2</sup>
<italic>n</italic>
<italic>l</italic>
). In this section we show how to reduce this total time from
<italic>O</italic>
(
<italic>m</italic>
<sup>2</sup>
<italic>n</italic>
<italic>l</italic>
)to
<italic>O</italic>
(
<italic>m</italic>
<sup>2</sup>
<italic>n</italic>
).</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption>
<p>
<bold>Illustration of speeding up the 2</bold>
<bold>
<italic>d</italic>
</bold>
<bold>-neighbors computation. A</bold>
:
<italic>l</italic>
-mer alignments.
<bold>B</bold>
: computation order in the matrix.</p>
</caption>
<graphic xlink:href="1471-2105-14-161-1"></graphic>
</fig>
<p>Assume that we have computed the Hamming distance between
<italic>x</italic>
<sub>1</sub>
in
<italic>s</italic>
<sub>1</sub>
and
<inline-formula>
<mml:math id="M17" name="1471-2105-14-161-i19" overflow="scroll">
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula>
in
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
. Let this be
<inline-formula>
<mml:math id="M18" name="1471-2105-14-161-i20" overflow="scroll">
<mml:mtext mathvariant="italic">HD</mml:mtext>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>
. Then,
<inline-formula>
<mml:math id="M19" name="1471-2105-14-161-i21" overflow="scroll">
<mml:mtext mathvariant="italic">HD</mml:mtext>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo>)</mml:mo>
</mml:math>
</inline-formula>
can be obtained by comparing: 1)the first characters of
<italic>x</italic>
<sub>1</sub>
and
<italic>x</italic>
<sub>
<italic>j</italic>
</sub>
; and 2)the last characters of
<italic>x</italic>
<sub>2</sub>
and
<italic>x</italic>
<sub>
<italic>j</italic>
+1</sub>
. Observe that the (
<italic>l</italic>
−1)-length prefix of
<italic>x</italic>
<sub>2</sub>
is the (
<italic>l</italic>
−1)-length suffix of
<italic>x</italic>
<sub>1</sub>
, and
<inline-formula>
<mml:math id="M20" name="1471-2105-14-161-i22" overflow="scroll">
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula>
and
<inline-formula>
<mml:math id="M21" name="1471-2105-14-161-i23" overflow="scroll">
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula>
also share the same (
<italic>l</italic>
−1)-mer.</p>
<p>If the first characters of
<italic>x</italic>
<sub>1</sub>
and
<italic>x</italic>
<sub>
<italic>j</italic>
</sub>
match, then the
<italic>d</italic>
<sub>1</sub>
mismatches happen in the remaining (
<italic>l</italic>
−1)-long suffixes of
<italic>x</italic>
<sub>1</sub>
and
<italic>x</italic>
<sub>
<italic>j</italic>
</sub>
. In this case,
<italic>H</italic>
<italic>D</italic>
(
<italic>x</italic>
<sub>2</sub>
,
<italic>x</italic>
<sub>
<italic>j</italic>
+1</sub>
)=
<italic>d</italic>
<sub>1</sub>
if the last characters of
<italic>x</italic>
<sub>2</sub>
and
<italic>x</italic>
<sub>
<italic>j</italic>
+1</sub>
match; otherwise
<italic>H</italic>
<italic>D</italic>
(
<italic>x</italic>
<sub>2</sub>
,
<italic>x</italic>
<sub>
<italic>j</italic>
+1</sub>
)=
<italic>d</italic>
<sub>1</sub>
+1. Similarly, if the first characters of
<italic>x</italic>
<sub>1</sub>
and
<italic>x</italic>
<sub>
<italic>j</italic>
</sub>
do not match, then there are (
<italic>d</italic>
<sub>1</sub>
−1)mismatches in the remaining (
<italic>l</italic>
−1)-long suffixes of
<italic>x</italic>
<sub>1</sub>
and
<italic>x</italic>
<sub>
<italic>j</italic>
</sub>
. In this case,
<italic>H</italic>
<italic>D</italic>
(
<italic>x</italic>
<sub>2</sub>
,
<italic>x</italic>
<sub>
<italic>j</italic>
+1</sub>
)=
<italic>d</italic>
<sub>1</sub>
−1 if the last characters of
<italic>x</italic>
<sub>2</sub>
and
<italic>x</italic>
<sub>
<italic>j</italic>
+1</sub>
match; otherwise
<italic>H</italic>
<italic>D</italic>
(
<italic>x</italic>
<sub>2</sub>
,
<italic>x</italic>
<sub>
<italic>j</italic>
+1</sub>
)=
<italic>d</italic>
<sub>1</sub>
. This observation is also mentioned in [
<xref ref-type="bibr" rid="B4">4</xref>
].</p>
<sec>
<title>Observation 4</title>
<p>Given
<inline-formula>
<mml:math id="M22" name="1471-2105-14-161-i24" overflow="scroll">
<mml:mtext mathvariant="italic">HD</mml:mtext>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>
where
<italic>x</italic>
<sub>1</sub>
and
<inline-formula>
<mml:math id="M23" name="1471-2105-14-161-i25" overflow="scroll">
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula>
are two
<italic>l</italic>
-mers in
<italic>s</italic>
<sub>1</sub>
and
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
, the Hamming distance between the next two
<italic>l</italic>
-mers of
<italic>s</italic>
<sub>1</sub>
and
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
,
<inline-formula>
<mml:math id="M24" name="1471-2105-14-161-i26" overflow="scroll">
<mml:mtext mathvariant="italic">HD</mml:mtext>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo>)</mml:mo>
<mml:mo>,</mml:mo>
</mml:math>
</inline-formula>
can be calculated in
<italic>O</italic>
(1)time as in (1):</p>
<p>
<disp-formula id="bmcM1">
<label>(1)</label>
<mml:math id="M25" name="1471-2105-14-161-i27" overflow="scroll">
<mml:mtext mathvariant="italic">HD</mml:mtext>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfenced open="{">
<mml:mrow>
<mml:mtable class="array" columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mtext>if</mml:mtext>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>]</mml:mo>
<mml:mo></mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo>[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>]</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>[</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>]</mml:mo>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd>
<mml:mtext>if</mml:mtext>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>]</mml:mo>
<mml:mo></mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo>[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>]</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>[</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>]</mml:mo>
<mml:mo></mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd>
<mml:mtext>if</mml:mtext>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>]</mml:mo>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo>[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>]</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>[</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>]</mml:mo>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mtext>if</mml:mtext>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>]</mml:mo>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo>[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>]</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>[</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>]</mml:mo>
<mml:mo></mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
</mml:math>
</disp-formula>
</p>
<p>However,
<inline-formula>
<mml:math id="M26" name="1471-2105-14-161-i28" overflow="scroll">
<mml:mtext mathvariant="italic">HD</mml:mtext>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo>)</mml:mo>
</mml:math>
</inline-formula>
where
<italic>p</italic>
>
<italic>q</italic>
is left out when OBSERVATION 4 is used repeatedly and reaches the end of
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
. We simply append a copy of
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
to
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
to cover all the pairwise alignments (Figure 
<xref ref-type="fig" rid="F1">1</xref>
A). Then, by calculating the Hamming distance only once and applying OBSERVATION 4 repeatedly, each diagonal in the matrix of Figure 
<xref ref-type="fig" rid="F1">1</xref>
B can be computed in
<italic>O</italic>
(
<italic>l</italic>
+
<italic>m</italic>
)time.</p>
<p>We do the above for
<italic>m</italic>
diagonals from cell
<inline-formula>
<mml:math id="M27" name="1471-2105-14-161-i29" overflow="scroll">
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo>)</mml:mo>
</mml:math>
</inline-formula>
to
<inline-formula>
<mml:math id="M28" name="1471-2105-14-161-i30" overflow="scroll">
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo>)</mml:mo>
</mml:math>
</inline-formula>
in Figure 
<xref ref-type="fig" rid="F1">1</xref>
B. Then the first and last (
<italic>m</italic>
<italic>l</italic>
+1)rows are used to form a complete (
<italic>m</italic>
<italic>l</italic>
+1)×(
<italic>m</italic>
<italic>l</italic>
+1)matrix. The
<italic>l</italic>
rows in the middle were eliminated since they are the extra rows caused by appending a copy of
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
. Therefore, the computation of the 2
<italic>d</italic>
-neighbors of
<italic>x</italic>
from
<italic>s</italic>
<sub>1</sub>
in any sequence
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
can be computed in
<italic>O</italic>
(
<italic>m</italic>
∗(
<italic>m</italic>
+
<italic>l</italic>
))=
<italic>O</italic>
(
<italic>m</italic>
<sup>2</sup>
)time. The computation for all the sequences from
<italic>s</italic>
<sub>2</sub>
to
<italic>s</italic>
<sub>
<italic>n</italic>
</sub>
takes a total of
<italic>O</italic>
(
<italic>m</italic>
<sup>2</sup>
<italic>n</italic>
)time.</p>
<p>The pseudocode is given in Algorithm 3. In lines 6–10, the Hamming distance is calculated for the alignment of
<italic>s</italic>
<sub>1</sub>
with the
<italic>j</italic>
<sub>
<italic>t</italic>
<italic>h</italic>
</sub>
position of
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
. Each of the remaining Hamming distances in this alignment is obtained in constant time by OBSERVATION 4 (line 12–26). Instead of appending a copy of
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
, the
<italic>mod</italic>
operation is used.
<bold>MSS2</bold>
</p>
<p>
<italic>N</italic>
<sub>2
<italic>d</italic>
</sub>
[
<italic>k</italic>
][
<italic>i</italic>
] keeps the 2
<italic>d</italic>
-neighbors of the
<italic>k</italic>
<sub>
<italic>t</italic>
<italic>h</italic>
</sub>
<italic>l</italic>
-mer in
<italic>s</italic>
<sub>1</sub>
in the
<italic>i</italic>
<sub>
<italic>t</italic>
<italic>h</italic>
</sub>
sequence
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
. To build the matrix of 2
<italic>d</italic>
-neighbors of all the
<italic>l</italic>
-mers of
<italic>s</italic>
<sub>1</sub>
(
<italic>N</italic>
<sub>2
<italic>d</italic>
</sub>
[
<italic>k</italic>
][
<italic>i</italic>
]), it takes
<italic>O</italic>
(
<italic>m</italic>
<sup>2</sup>
<italic>n</italic>
)time (lines 3–28). Lines 29–41 search the 2
<italic>d</italic>
-neighbors of each
<italic>l</italic>
-mer of
<italic>s</italic>
<sub>1</sub>
. If some sequence
<italic>s</italic>
<sub>
<italic>j</italic>
</sub>
has no 2
<italic>d</italic>
-neighbors, the current
<italic>i</italic>
<sub>
<italic>t</italic>
<italic>h</italic>
</sub>
<italic>l</italic>
-mer of
<italic>s</italic>
<sub>1</sub>
is skipped (lines 32–34). Otherwise, the 2
<italic>d</italic>
-neighbors in the sequence with the smallest size,
<italic>I</italic>
<sub>
<italic>m</italic>
<italic>i</italic>
<italic>n</italic>
</sub>
, are used and the wildcards are placed.</p>
<p>MSS2 takes
<italic>O</italic>
(
<italic>m</italic>
<sup>2</sup>
<italic>n</italic>
+|
<italic>s</italic>
<italic>t</italic>
<italic>e</italic>
<italic>m</italic>
<italic>s</italic>
|)time.</p>
<p>Optionally, a post-process phase is used following both MSS1 and MSS2 algorithms to refine the output stems. In the post-process phase, a stem is retained only if this stem has at least one neighbor in each sequence at a distance of ≤
<italic>d</italic>
. This phase takes a total of
<italic>O</italic>
(
<italic>m</italic>
<italic>n</italic>
<italic>l</italic>
∗|
<italic>s</italic>
<italic>t</italic>
<italic>e</italic>
<italic>m</italic>
|)time.</p>
</sec>
<sec>
<title>An estimation on the number of stems</title>
<p>We can compute the expected number of stems generated by our algorithms as follows. Let
<italic>q</italic>
be any
<italic>l</italic>
-mer in
<italic>s</italic>
<sub>1</sub>
. What can we say about |
<italic>I</italic>
<sub>
<italic>m</italic>
<italic>i</italic>
<italic>n</italic>
</sub>
| corresponding to
<italic>q</italic>
? Consider any sequence
<italic>s</italic>
other than
<italic>s</italic>
<sub>1</sub>
. Let
<italic>Q</italic>
be any
<italic>l</italic>
-mer of
<italic>s</italic>
. The probability
<italic>p</italic>
that HD (
<italic>q</italic>
,
<italic>Q</italic>
)≤2
<italic>d</italic>
is
<inline-formula>
<mml:math id="M29" name="1471-2105-14-161-i31" overflow="scroll">
<mml:msubsup>
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mfenced open="(" close=")">
<mml:mfrac linethickness="0">
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mfenced>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>σ</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>σ</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msup>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>σ</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>
where
<italic>σ</italic>
=|
<italic>Σ</italic>
|. This implies that the expected number of such
<italic>Q</italic>
’s is
<italic>mp</italic>
. When
<italic>Σ</italic>
increases,
<italic>p</italic>
decreases drastically, as examples are shown in Table
<xref ref-type="table" rid="T2">2</xref>
for
<italic>σ</italic>
=4 and
<italic>σ</italic>
=20. In all the previous works (see e.g., [
<xref ref-type="bibr" rid="B6">6</xref>
]), analyses have been done assuming that all the
<italic>l</italic>
-mers in any sequence are independent. If we assume this, then we can apply Chernoff bounds and show that the number of such
<italic>Q</italic>
’s is
<italic>O</italic>
(
<italic>m</italic>
<italic>p</italic>
)with high probability. This in turn will imply that |
<italic>I</italic>
<sub>
<italic>m</italic>
<italic>i</italic>
<italic>n</italic>
</sub>
|=
<italic>O</italic>
(
<italic>m</italic>
<italic>p</italic>
)with high probability.
<italic>N</italic>
<sub>
<italic>stems</italic>
</sub>
, the number of stems generated between any two
<italic>l</italic>
-mers with the hamming distance
<italic>d</italic>
<sub>
<italic>H</italic>
<italic>M</italic>
</sub>
, is given in (2), which is crudely bounded by
<italic>O</italic>
(2
<sup>
<italic>l</italic>
</sup>
<italic>l</italic>
<sup>
<italic>d</italic>
</sup>
). As a result, it follows that the expected number of stems generated by our algorithms is
<italic>O</italic>
(
<italic>m</italic>
<sup>2</sup>
<italic>p</italic>
2
<sup>
<italic>l</italic>
</sup>
<italic>l</italic>
<sup>
<italic>d</italic>
</sup>
).</p>
<p>
<disp-formula id="bmcM2">
<label>(2)</label>
<mml:math id="M30" name="1471-2105-14-161-i32" overflow="scroll">
<mml:msub>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">stems</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mo>=</mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mfenced open="{">
<mml:mrow>
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:msubsup>
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">HM</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msubsup>
<mml:mfenced open="(" close=")">
<mml:mfrac linethickness="0">
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mfenced>
<mml:mfenced open="(" close=")">
<mml:mfrac linethickness="0">
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mo></mml:mo>
<mml:mo>max</mml:mo>
<mml:mo>{</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mfenced>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">HM</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>d</mml:mi>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:msubsup>
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">HM</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mfenced open="(" close=")">
<mml:mfrac linethickness="0">
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mfenced>
<mml:mfenced open="(" close=")">
<mml:mfrac linethickness="0">
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mo></mml:mo>
<mml:mo>max</mml:mo>
<mml:mo>{</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mfenced>
</mml:mtd>
<mml:mtd>
<mml:mi>d</mml:mi>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">HM</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>d</mml:mi>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
</mml:math>
</disp-formula>
</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption>
<p>
<bold>Example values of </bold>
<bold>
<italic>p </italic>
</bold>
<bold> given | </bold>
<bold>
<italic>Σ </italic>
</bold>
<bold>|=4 and | </bold>
<bold>
<italic>Σ </italic>
</bold>
<bold>|=20</bold>
</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="left">
<bold>(</bold>
<bold>
<italic>l,d)</italic>
</bold>
</th>
<th align="center">
<bold>| </bold>
<bold>
<italic>Σ </italic>
</bold>
<bold>|=4</bold>
</th>
<th align="center">
<bold>| </bold>
<bold>
<italic>Σ </italic>
</bold>
<bold>|=20</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="bottom">(7,1)
<hr></hr>
</td>
<td align="center" valign="bottom">1.29∗10
<sup>−2</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">6.03∗10
<sup>−6</sup>
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">(9,2)
<hr></hr>
</td>
<td align="center" valign="bottom">4.89∗10
<sup>−2</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">3.32∗10
<sup>−5</sup>
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">(11,3)
<hr></hr>
</td>
<td align="center" valign="bottom">1.15∗10
<sup>−1</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">1.11∗10
<sup>−4</sup>
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">(13,4)
<hr></hr>
</td>
<td align="center" valign="bottom">6.38∗10
<sup>−2</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">8.88∗10
<sup>−5</sup>
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">(15,5)
<hr></hr>
</td>
<td align="center" valign="bottom">4.27∗10
<sup>−4</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">8.22∗10
<sup>−7</sup>
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">(17,6)
<hr></hr>
</td>
<td align="center" valign="bottom">5.82∗10
<sup>−10</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">2.76∗10
<sup>−20</sup>
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">(19,7)
<hr></hr>
</td>
<td align="center" valign="bottom">3.64∗10
<sup>−12</sup>
<hr></hr>
</td>
<td align="center" valign="bottom">1.91∗10
<sup>−25</sup>
<hr></hr>
</td>
</tr>
<tr>
<td align="left">(21,8)</td>
<td align="center">1.43∗10
<sup>−3</sup>
</td>
<td align="center">1.21∗10
<sup>−5</sup>
</td>
</tr>
</tbody>
</table>
</table-wrap>
<sec>
<title>Algorithm 3
<bold>MSS2</bold>
</title>
</sec>
</sec>
<sec>
<title>Challenging instances</title>
<p>Consider a PMS instance with
<italic>n</italic>
sequences of length
<italic>m</italic>
each. For a given value
<italic>l</italic>
, let
<italic>d</italic>
be the smallest integer such that the expected number of (
<italic>l,d</italic>
)motifs that occur by random chance is ≥1. We refer to (
<italic>l,d</italic>
)as a challenging instance. We can compute challenging instances as follows. Let the alphabet under concern be
<italic>Σ</italic>
with |
<italic>Σ</italic>
|=
<italic>σ</italic>
. The probability that two random characters in this alphabet match is 1/|
<italic>σ</italic>
|. Then assuming an IID background, the probability that the Hamming distance between two
<italic>l</italic>
-mers is no more than
<italic>d</italic>
is
<inline-formula>
<mml:math id="M31" name="1471-2105-14-161-i34" overflow="scroll">
<mml:mi>p</mml:mi>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathsize="big"></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mfenced open="(" close=")">
<mml:mfrac linethickness="0">
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mfenced>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>σ</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>σ</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msup>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>σ</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>
. For each sequence, the probability that a random
<italic>l</italic>
-mer has at least one
<italic>d</italic>
-neighbor (i.e., an
<italic>l</italic>
-mer with a Hamming distance of no more than
<italic>d</italic>
)in this sequence is
<italic>P</italic>
=1−(1−
<italic>p</italic>
)
<sup>
<italic>m</italic>
<italic>l</italic>
+1</sup>
. This means that the expected number of randomly occurring (
<italic>l,d</italic>
)motifs in the
<italic>n</italic>
sequences is
<italic>σ</italic>
<sup>
<italic>l</italic>
</sup>
<italic>P</italic>
<sup>
<italic>n</italic>
</sup>
. From this we can calculate the challenging instances. For
<italic>σ</italic>
=4, the challenging instances are (7,1),(9,2), etc. When
<italic>σ</italic>
=20, the challenging instances are (7,4),(9,5), etc. Because of Observation 1, in our tests of challenging instances of protein sequences, we have used the cases of (7,3),(9,4), and (11,5).</p>
</sec>
</sec>
</sec>
</sec>
<sec sec-type="results">
<title>Results</title>
<p>We have evaluated our algorithms on the standard benchmark where
<italic>n</italic>
=20,
<italic>m</italic>
=600. Let |
<italic>Σ</italic>
|=20 (for proteins). We have used (
<italic>l,d</italic>
)values starting from (7,1)going up to (21,8).</p>
<p>The testing data was generated as follows: 1)20 sequences of length 600 each were generated such that each character in each sequence is equally likely to be one of the characters from the alphabet; 2)a motif of length
<italic>l</italic>
was generated randomly; 3)a random number of mismatch positions which is smaller than or equal to
<italic>d</italic>
was selected and the characters in these positions were replaced by other amino acids randomly to form a motif instance; 4)step 3)was done 20 times to generate 20 such instances and these were planted in the 20 sequences (at random places with one instance per sequence).</p>
<p>We have implemented and compared our algorithms with RISOTTO [
<xref ref-type="bibr" rid="B14">14</xref>
] and the stemming algorithm of [
<xref ref-type="bibr" rid="B21">21</xref>
]. Please note that we have implemented the algorithm of [
<xref ref-type="bibr" rid="B21">21</xref>
] since we have no access to a running version of the corresponding program. Both the running time and the number of stems generated were used as performance measures. The machine used had an Intel Core i7-2760QM 2.40GHZ processor with a memory size of 4GB, as shown in Table
<xref ref-type="table" rid="T3">3</xref>
and Table
<xref ref-type="table" rid="T4">4</xref>
. In these tables "-" indicates that the algorithm took too long to finish. These tables show that MSS1 and MSS2 run faster than RISOTTO [
<xref ref-type="bibr" rid="B14">14</xref>
] and stemming [
<xref ref-type="bibr" rid="B21">21</xref>
]. Since the set of stems is a superset of the true motifs, the stems set contains true motifs and false motifs (or false positive predictions). A smaller number of stems indicates less false predictions. The proposed algorithms generate a much smaller subset of the stems generated by the stemming algorithm [
<xref ref-type="bibr" rid="B21">21</xref>
]. Since instances such as (7,1),(9,2),(11,3),
<italic>e</italic>
<italic>t</italic>
<italic>c</italic>
. are commonly used in DNA sequences, we have also tested the algorithms on more challenging cases such as (7,3),(9,4), and (11,5)as shown in Table
<xref ref-type="table" rid="T5">5</xref>
. In addition to the case of
<italic>σ</italic>
=20, we have also tried different alphabet sizes: 40, 60, 80, and 100. Table
<xref ref-type="table" rid="T6">6</xref>
displays the running time for various alphabet sizes. The fact that the rune times are nearly the same for different alphabet sizes indicates that the running time of all the algorithms are independent of the alphabet size. The post-processing phase takes longer time as the alphabet size increases since the number of stems increases.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption>
<p>Time comparison of MSS, RISOTTO, and stemming algorithms</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="left">
<bold>(</bold>
<bold>
<italic>l,d </italic>
</bold>
<bold>)</bold>
</th>
<th align="center">
<bold>MSS1(s)</bold>
</th>
<th align="center">
<bold>MSS2(s)</bold>
</th>
<th align="center">
<bold>Post-process(s)</bold>
</th>
<th align="center">
<bold>RISOTTO(s)</bold>
</th>
<th align="center">
<bold>Stemming(s)</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="bottom">(7,1)
<hr></hr>
</td>
<td align="center" valign="bottom">23.2
<hr></hr>
</td>
<td align="center" valign="bottom">3.7
<hr></hr>
</td>
<td align="center" valign="bottom">0.03
<hr></hr>
</td>
<td align="center" valign="bottom">4.7
<hr></hr>
</td>
<td align="center" valign="bottom">48.0
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">(9,2)
<hr></hr>
</td>
<td align="center" valign="bottom">24.64
<hr></hr>
</td>
<td align="center" valign="bottom">3.7
<hr></hr>
</td>
<td align="center" valign="bottom">0.3
<hr></hr>
</td>
<td align="center" valign="bottom">242.3
<hr></hr>
</td>
<td align="center" valign="bottom">359.9
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">(11,3)
<hr></hr>
</td>
<td align="center" valign="bottom">27.5
<hr></hr>
</td>
<td align="center" valign="bottom">3.7
<hr></hr>
</td>
<td align="center" valign="bottom">2.0
<hr></hr>
</td>
<td align="center" valign="bottom">7166.1
<hr></hr>
</td>
<td align="center" valign="bottom">4674.2
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">(13,4)
<hr></hr>
</td>
<td align="center" valign="bottom">34.5
<hr></hr>
</td>
<td align="center" valign="bottom">3.9
<hr></hr>
</td>
<td align="center" valign="bottom">50.4
<hr></hr>
</td>
<td align="center" valign="bottom">-
<hr></hr>
</td>
<td align="center" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">(15,5)
<hr></hr>
</td>
<td align="center" valign="bottom">38.8
<hr></hr>
</td>
<td align="center" valign="bottom">4.7
<hr></hr>
</td>
<td align="center" valign="bottom">74.5
<hr></hr>
</td>
<td align="center" valign="bottom">-
<hr></hr>
</td>
<td align="center" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">(17,6)
<hr></hr>
</td>
<td align="center" valign="bottom">60.2
<hr></hr>
</td>
<td align="center" valign="bottom">15.3
<hr></hr>
</td>
<td align="center" valign="bottom">1459.0
<hr></hr>
</td>
<td align="center" valign="bottom">-
<hr></hr>
</td>
<td align="center" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">(19,7)
<hr></hr>
</td>
<td align="center" valign="bottom">200.8
<hr></hr>
</td>
<td align="center" valign="bottom">117.3
<hr></hr>
</td>
<td align="center" valign="bottom">18364.1
<hr></hr>
</td>
<td align="center" valign="bottom">-
<hr></hr>
</td>
<td align="center" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left">(21,8)</td>
<td align="center">719.6</td>
<td align="center">563.2</td>
<td align="center">69340.8</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption>
<p>Number of stems generated by MSS and stemming algorithms</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="left">
<bold>(</bold>
<bold>
<italic>l,d </italic>
</bold>
<bold>)</bold>
</th>
<th align="center">
<bold>MSS1/MSS2</bold>
</th>
<th align="center">
<bold>Post-process</bold>
</th>
<th align="center">
<bold>Stemming</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="bottom">(7,1)
<hr></hr>
</td>
<td align="center" valign="bottom">2
<hr></hr>
</td>
<td align="center" valign="bottom">1
<hr></hr>
</td>
<td align="center" valign="bottom">928
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">(9,2)
<hr></hr>
</td>
<td align="center" valign="bottom">22
<hr></hr>
</td>
<td align="center" valign="bottom">2
<hr></hr>
</td>
<td align="center" valign="bottom">17894
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">(11,3)
<hr></hr>
</td>
<td align="center" valign="bottom">130
<hr></hr>
</td>
<td align="center" valign="bottom">44
<hr></hr>
</td>
<td align="center" valign="bottom">265587
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">(13,4)
<hr></hr>
</td>
<td align="center" valign="bottom">2250
<hr></hr>
</td>
<td align="center" valign="bottom">1452
<hr></hr>
</td>
<td align="center" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">(15,5)
<hr></hr>
</td>
<td align="center" valign="bottom">5222
<hr></hr>
</td>
<td align="center" valign="bottom">1032
<hr></hr>
</td>
<td align="center" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">(17,6)
<hr></hr>
</td>
<td align="center" valign="bottom">60168
<hr></hr>
</td>
<td align="center" valign="bottom">23829
<hr></hr>
</td>
<td align="center" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">(19,7)
<hr></hr>
</td>
<td align="center" valign="bottom">521658
<hr></hr>
</td>
<td align="center" valign="bottom">422019
<hr></hr>
</td>
<td align="center" valign="bottom">-
<hr></hr>
</td>
</tr>
<tr>
<td align="left">(21,8)</td>
<td align="center">2255690</td>
<td align="center">1297576</td>
<td align="center">-</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption>
<p>Comparison of MSS, ROSOTTO, and stemming algorithms on challenging instances</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="left">
<bold>(</bold>
<bold>
<italic>l,d </italic>
</bold>
<bold>)</bold>
</th>
<th align="center">
<bold>MSS1(s)</bold>
</th>
<th align="center">
<bold>MSS2(s)</bold>
</th>
<th align="center">
<bold>ROSITTO</bold>
</th>
<th align="center">
<bold>Stemming</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="bottom">(7,3)
<hr></hr>
</td>
<td align="center" valign="bottom">225.9
<hr></hr>
</td>
<td align="center" valign="bottom">615.7
<hr></hr>
</td>
<td align="center" valign="bottom">7068.6
<hr></hr>
</td>
<td align="center" valign="bottom">>4hours
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">(9,4)
<hr></hr>
</td>
<td align="center" valign="bottom">1051.0
<hr></hr>
</td>
<td align="center" valign="bottom">1477.4
<hr></hr>
</td>
<td align="center" valign="bottom">>4hours
<hr></hr>
</td>
<td align="center" valign="bottom">>4hours
<hr></hr>
</td>
</tr>
<tr>
<td align="left">(11,5)</td>
<td align="center">5129.4</td>
<td align="center">5503.0</td>
<td align="center">>4hours</td>
<td align="center">>4hours</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap position="float" id="T6">
<label>Table 6</label>
<caption>
<p>Statistics on different alphabet sizes</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="left">
<bold>| </bold>
<bold>
<italic>Σ </italic>
</bold>
<bold>|</bold>
</th>
<th align="center">
<bold>MSS1(s)/ | </bold>
<bold>
<italic>s </italic>
</bold>
<bold>
<italic>t </italic>
</bold>
<bold>
<italic>e </italic>
</bold>
<bold>
<italic>m </italic>
</bold>
<bold>
<italic>s </italic>
</bold>
<bold>|</bold>
</th>
<th align="center">
<bold>MSS2(s)/ | </bold>
<bold>
<italic>s </italic>
</bold>
<bold>
<italic>t </italic>
</bold>
<bold>
<italic>e </italic>
</bold>
<bold>
<italic>m </italic>
</bold>
<bold>
<italic>s </italic>
</bold>
<bold>|</bold>
</th>
<th align="center">
<bold>Post-process(s)/ | </bold>
<bold>
<italic>s </italic>
</bold>
<bold>
<italic>t </italic>
</bold>
<bold>
<italic>e </italic>
</bold>
<bold>
<italic>m </italic>
</bold>
<bold>
<italic>s </italic>
</bold>
<bold>|</bold>
</th>
<th align="center">
<bold>Stemming(s)/ | </bold>
<bold>
<italic>s </italic>
</bold>
<bold>
<italic>t </italic>
</bold>
<bold>
<italic>e </italic>
</bold>
<bold>
<italic>m </italic>
</bold>
<bold>
<italic>s </italic>
</bold>
<bold>|</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="bottom">40
<hr></hr>
</td>
<td align="center" valign="bottom">25.1/190
<hr></hr>
</td>
<td align="center" valign="bottom">3.6/190
<hr></hr>
</td>
<td align="center" valign="bottom">2.4/45
<hr></hr>
</td>
<td align="center" valign="bottom">2125.5/16669665
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">60
<hr></hr>
</td>
<td align="center" valign="bottom">26.2/400
<hr></hr>
</td>
<td align="center" valign="bottom">3.6/400
<hr></hr>
</td>
<td align="center" valign="bottom">6.9/169
<hr></hr>
</td>
<td align="center" valign="bottom">3023.4/18465345
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">80
<hr></hr>
</td>
<td align="center" valign="bottom">23.6/50
<hr></hr>
</td>
<td align="center" valign="bottom">3.6/50
<hr></hr>
</td>
<td align="center" valign="bottom">0.4/4
<hr></hr>
</td>
<td align="center" valign="bottom">3493.0/11380993
<hr></hr>
</td>
</tr>
<tr>
<td align="left">100</td>
<td align="center">27.1/260</td>
<td align="center">3.6/260</td>
<td align="center">5.6/216</td>
<td align="center">4464.9/17733385</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Due to better performance, we have used MSS2 in real biological protein data. In Minimotif Miner 3.0 database [
<xref ref-type="bibr" rid="B1">1</xref>
], we randomly sampled 14 protein motifs. Each of these motifs has multiple source proteins. Comparison of MSS, RISOTTO, and stemming is shown in Table
<xref ref-type="table" rid="T7">7</xref>
.</p>
<table-wrap position="float" id="T7">
<label>Table 7</label>
<caption>
<p>Motif search on protein data</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="left">
<bold>Protein motifs</bold>
</th>
<th align="center">
<bold>
<italic>#</italic>
</bold>
<bold>Source proteins</bold>
</th>
<th align="center">
<bold>(</bold>
<bold>
<italic>l,d </italic>
</bold>
<bold>)</bold>
</th>
<th align="center">
<bold>MSS2(s)</bold>
</th>
<th align="center">
<bold>RISOTTO(s)</bold>
</th>
<th align="center">
<bold>Stemming(s)</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="bottom">CPTINEPCC
<hr></hr>
</td>
<td align="center" valign="bottom">7
<hr></hr>
</td>
<td align="center" valign="bottom">(9,2)
<hr></hr>
</td>
<td align="center" valign="bottom">2.0
<hr></hr>
</td>
<td align="center" valign="bottom">100.0
<hr></hr>
</td>
<td align="center" valign="bottom">244.0
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">CRFYNCHHLHEPGC
<hr></hr>
</td>
<td align="center" valign="bottom">10
<hr></hr>
</td>
<td align="center" valign="bottom">(14,4)
<hr></hr>
</td>
<td align="center" valign="bottom">22.2
<hr></hr>
</td>
<td align="center" valign="bottom">>4hours
<hr></hr>
</td>
<td align="center" valign="bottom">>4hours
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">HTHPTQTAFLSSVD
<hr></hr>
</td>
<td align="center" valign="bottom">8
<hr></hr>
</td>
<td align="center" valign="bottom">(14,4)
<hr></hr>
</td>
<td align="center" valign="bottom">10.3
<hr></hr>
</td>
<td align="center" valign="bottom">>4hours
<hr></hr>
</td>
<td align="center" valign="bottom">>4hours
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">ILPPVPVPK
<hr></hr>
</td>
<td align="center" valign="bottom">14
<hr></hr>
</td>
<td align="center" valign="bottom">(9,2)
<hr></hr>
</td>
<td align="center" valign="bottom">3.8
<hr></hr>
</td>
<td align="center" valign="bottom">105.8
<hr></hr>
</td>
<td align="center" valign="bottom">582.0
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">PEPNGYLHIGH
<hr></hr>
</td>
<td align="center" valign="bottom">134
<hr></hr>
</td>
<td align="center" valign="bottom">(11,3)
<hr></hr>
</td>
<td align="center" valign="bottom">51.1
<hr></hr>
</td>
<td align="center" valign="bottom">12827.0
<hr></hr>
</td>
<td align="center" valign="bottom">>4hours
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">PSPTGFIHLGN
<hr></hr>
</td>
<td align="center" valign="bottom">36
<hr></hr>
</td>
<td align="center" valign="bottom">(11,3)
<hr></hr>
</td>
<td align="center" valign="bottom">6.5
<hr></hr>
</td>
<td align="center" valign="bottom">4336.6
<hr></hr>
</td>
<td align="center" valign="bottom">4561.0
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">PTVYNYAHIGN
<hr></hr>
</td>
<td align="center" valign="bottom">19
<hr></hr>
</td>
<td align="center" valign="bottom">(11,3)
<hr></hr>
</td>
<td align="center" valign="bottom">3.6
<hr></hr>
</td>
<td align="center" valign="bottom">3358.9
<hr></hr>
</td>
<td align="center" valign="bottom">4917.0
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">PYANGSIHLGH
<hr></hr>
</td>
<td align="center" valign="bottom">110
<hr></hr>
</td>
<td align="center" valign="bottom">(11,3)
<hr></hr>
</td>
<td align="center" valign="bottom">52.1
<hr></hr>
</td>
<td align="center" valign="bottom">11363.2
<hr></hr>
</td>
<td align="center" valign="bottom">>4hours
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">PYPSGQGLHVGH
<hr></hr>
</td>
<td align="center" valign="bottom">18
<hr></hr>
</td>
<td align="center" valign="bottom">(12,3)
<hr></hr>
</td>
<td align="center" valign="bottom">10.4
<hr></hr>
</td>
<td align="center" valign="bottom">>4hours
<hr></hr>
</td>
<td align="center" valign="bottom">>4hours
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">QELFKRISEQFTAMF
<hr></hr>
</td>
<td align="center" valign="bottom">9
<hr></hr>
</td>
<td align="center" valign="bottom">(15,4)
<hr></hr>
</td>
<td align="center" valign="bottom">47.6
<hr></hr>
</td>
<td align="center" valign="bottom">>4hours
<hr></hr>
</td>
<td align="center" valign="bottom">>4hours
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">QIKTLNNKFASFIDK
<hr></hr>
</td>
<td align="center" valign="bottom">9
<hr></hr>
</td>
<td align="center" valign="bottom">(15,4)
<hr></hr>
</td>
<td align="center" valign="bottom">20.3
<hr></hr>
</td>
<td align="center" valign="bottom">>4hours
<hr></hr>
</td>
<td align="center" valign="bottom">>4hours
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">SGYSSPGSPGTPGSR
<hr></hr>
</td>
<td align="center" valign="bottom">9
<hr></hr>
</td>
<td align="center" valign="bottom">(15,4)
<hr></hr>
</td>
<td align="center" valign="bottom">32.6
<hr></hr>
</td>
<td align="center" valign="bottom">>4hours
<hr></hr>
</td>
<td align="center" valign="bottom">>4hours
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">SSSSLEKSYELPDGQ
<hr></hr>
</td>
<td align="center" valign="bottom">10
<hr></hr>
</td>
<td align="center" valign="bottom">(15,4)
<hr></hr>
</td>
<td align="center" valign="bottom">41.3
<hr></hr>
</td>
<td align="center" valign="bottom">>4hours
<hr></hr>
</td>
<td align="center" valign="bottom">>4hours
<hr></hr>
</td>
</tr>
<tr>
<td align="left">VTVYDYCHLGH</td>
<td align="center">8</td>
<td align="center">(11,3)</td>
<td align="center">2.9</td>
<td align="center">3145.8</td>
<td align="center">2235.0</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Finally, we have compared the MSS2 algorithm with PMSPrune, a well-known Plant Motif Search (PMS)algorithm on DNA sequences [
<xref ref-type="bibr" rid="B22">22</xref>
]. As is clear from Table
<xref ref-type="table" rid="T8">8</xref>
, MSS2 is not as fast as PMSPrune. On DNA sequences, the number of spurious motifs is very large. Therefore, the Motif Stems Search algorithms, which are efficient for large alphabets are not as efficient for small alphabets.</p>
<table-wrap position="float" id="T8">
<label>Table 8</label>
<caption>
<p>MSS2 vs. PMSPrune on DNA data</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="left">
<bold>(</bold>
<bold>
<italic>l,d </italic>
</bold>
<bold>)</bold>
</th>
<th align="center">
<bold>MSS2(s)</bold>
</th>
<th align="center">
<bold>PMSPrune(s)</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="bottom">(7,1)
<hr></hr>
</td>
<td align="center" valign="bottom">4.1
<hr></hr>
</td>
<td align="center" valign="bottom">3.3
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">(9,2)
<hr></hr>
</td>
<td align="center" valign="bottom">10.7
<hr></hr>
</td>
<td align="center" valign="bottom">3.4
<hr></hr>
</td>
</tr>
<tr>
<td align="left">(11,3)</td>
<td align="center">87.2</td>
<td align="center">8.1</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>Discussion and conclusions</title>
<p>The analysis in [
<xref ref-type="bibr" rid="B21">21</xref>
] shows that, assuming IID background, the expected number of the (
<italic>l</italic>
,2
<italic>d</italic>
)-motifs depends highly on the alphabet size |
<italic>Σ</italic>
|. Therefore, when |
<italic>Σ</italic>
| is large, the expected number of 2
<italic>d</italic>
-neighbors in the
<italic>n</italic>
<italic>m</italic>
-length sequences is very small in comparison with the total number of
<italic>l</italic>
-mers (
<italic>n</italic>
(
<italic>m</italic>
<italic>l</italic>
+1)).</p>
<p>The proposed algorithms consider an even smaller size of candidates by introducing
<italic>I</italic>
<sub>
<italic>m</italic>
<italic>i</italic>
<italic>n</italic>
</sub>
. In particular, for any given
<italic>l</italic>
-mer
<italic>x</italic>
, we focus on the sequence that has the smallest number of 2
<italic>d</italic>
-neighbors for
<italic>x</italic>
. The expected size of
<italic>I</italic>
<sub>
<italic>m</italic>
<italic>i</italic>
<italic>n</italic>
</sub>
is
<inline-formula>
<mml:math id="M32" name="1471-2105-14-161-i35" overflow="scroll">
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:math>
</inline-formula>
times the total number of 2
<italic>d</italic>
-neighbors of
<italic>x</italic>
in all the sequences. Please note that we do not miss any of the valid motifs.</p>
<p>On the other hand, when generating the stems, as shown in Table
<xref ref-type="table" rid="T1">1</xref>
, once
<italic>i</italic>
wildcards in the non-matching region are placed, it is known that the upper bound of wildcards in the matching region is
<italic>d</italic>
− max(
<italic>i</italic>
,
<italic>d</italic>
<sub>
<italic>x</italic>
</sub>
<italic>i</italic>
). However, it is not necessary to enumerate all the cases from 0 to
<italic>d</italic>
− max(
<italic>i</italic>
,
<italic>d</italic>
<sub>
<italic>x</italic>
</sub>
<italic>i</italic>
)in the matching region. As long as the case of (
<italic>d</italic>
− max(
<italic>i</italic>
,
<italic>d</italic>
<sub>
<italic>x</italic>
</sub>
<italic>i</italic>
))wildcards cannot be eliminated, 0 to (
<italic>d</italic>
− max(
<italic>i</italic>
,
<italic>d</italic>
<sub>
<italic>x</italic>
</sub>
<italic>i</italic>
)−1)wildcards are contained in the (
<italic>d</italic>
− max(
<italic>i</italic>
,
<italic>d</italic>
<sub>
<italic>x</italic>
</sub>
<italic>i</italic>
))wildcards placement. Therefore, the proposed algorithms do not enumerate 0 to (
<italic>d</italic>
− max(
<italic>i</italic>
,
<italic>d</italic>
<sub>
<italic>x</italic>
</sub>
<italic>i</italic>
)−1)wildcards placements in the output.</p>
<p>In the computation of the 2
<italic>d</italic>
-neighbors, MSS1 takes
<italic>O</italic>
(
<italic>m</italic>
<sup>2</sup>
<italic>n</italic>
<italic>l</italic>
)time and
<italic>O</italic>
(
<italic>m</italic>
)space. MSS2 takes
<italic>O</italic>
(
<italic>m</italic>
<sup>2</sup>
<italic>n</italic>
)time and
<italic>O</italic>
(
<italic>m</italic>
<sup>2</sup>
)space. The stemming algorithm of [
<xref ref-type="bibr" rid="B21">21</xref>
] uses sorting to compute the set
<italic>I</italic>
.</p>
<p>The proposed algorithms MSS1 and MSS2 provide an efficient way to solve the Motif Stems Search problem in terms of both time and space. Also, the stems generated by MSS1 and MSS2 form a much smaller subset, with less false predictions, of the stems generated by the algorithm of [
<xref ref-type="bibr" rid="B21">21</xref>
].</p>
</sec>
<sec>
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec>
<title>Authors’ contributions</title>
<p>TM contributed to the implementation of the algorithms, manuscript preparation, algorithms development, and performance analysis. SR contributed to algorithms development, analysis of the results, performance analysis, and manuscript preparation. Both authors read and approved the final manuscript.</p>
</sec>
</body>
<back>
<sec>
<title>Acknowledgements</title>
<p>This work has been supported in part by the following grants: NSF 0829916 and NIH R01-LM010101.</p>
</sec>
<ref-list>
<ref id="B1">
<mixed-citation publication-type="journal">
<name>
<surname>Mi</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Merlin</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Deverasetty</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Gryk</surname>
<given-names>MR</given-names>
</name>
<name>
<surname>Bill</surname>
<given-names>TJ</given-names>
</name>
<name>
<surname>Brooks</surname>
<given-names>AW</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>LY</given-names>
</name>
<name>
<surname>Rathnayake</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Ross</surname>
<given-names>CA</given-names>
</name>
<name>
<surname>Sargeant</surname>
<given-names>DP</given-names>
</name>
<name>
<surname>Strong</surname>
<given-names>CL</given-names>
</name>
<name>
<surname>Watts</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Rajasekaran</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Schiller</surname>
<given-names>MR</given-names>
</name>
<article-title>Minimotif Miner 3.0: database expansion and significantly improved reduction of false-positive predictions from consensus sequences</article-title>
<source>Nucleic Acids Res</source>
<year>2012</year>
<volume>40</volume>
<fpage>D252—D260</fpage>
<pub-id pub-id-type="pmid">22146221</pub-id>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="journal">
<name>
<surname>Gould</surname>
<given-names>CM</given-names>
</name>
<name>
<surname>Diella</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Via</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Puntervoll</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Gemund</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Chabanis-Davidson</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Michael</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sayadi</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Bryne</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Chica</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Seiler</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Davey</surname>
<given-names>NE</given-names>
</name>
<name>
<surname>Haslam</surname>
<given-names>NJ</given-names>
</name>
<name>
<surname>Weatheritt</surname>
<given-names>RJ</given-names>
</name>
<name>
<surname>Budd</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Hughes</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Pas</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Rychlewski</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Trave</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Aasland</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Helmer-Citterich</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Linding</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Gibson</surname>
<given-names>TJ</given-names>
</name>
<article-title>ELM: the status of the 2010 eukaryotic linear motif resource</article-title>
<source>Nucleic Acids Res</source>
<year>2010</year>
<volume>38</volume>
<fpage>167</fpage>
<lpage>180</lpage>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal">
<name>
<surname>Obenauer</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Cantley</surname>
<given-names>LC</given-names>
</name>
<name>
<surname>Yaffe</surname>
<given-names>MB</given-names>
</name>
<article-title>Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs</article-title>
<source>Nucleic Acids Res</source>
<year>2003</year>
<volume>31</volume>
<fpage>3635</fpage>
<lpage>3641</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkg584</pub-id>
<pub-id pub-id-type="pmid">12824383</pub-id>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="book">
<name>
<surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
<name>
<surname>hoi Sze</surname>
<given-names>S</given-names>
</name>
<article-title>Combinatorial approaches to finding subtle signals in DNA sequences</article-title>
<source>Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology</source>
<year>2000</year>
<publisher-name>Menlo Park: AAAI Press</publisher-name>
<fpage>269</fpage>
<lpage>278</lpage>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<name>
<surname>Price</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ramabhadran</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
<article-title>Finding subtle motifs by branching from sample strings</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>19</volume>
<fpage>149</fpage>
<lpage>155</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/19.1.149</pub-id>
<pub-id pub-id-type="pmid">12499305</pub-id>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="journal">
<name>
<surname>Buhler</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Tompa</surname>
<given-names>M</given-names>
</name>
<article-title>Finding motifs using random projections</article-title>
<source>J Comput Biol</source>
<year>2002</year>
<volume>9</volume>
<fpage>225</fpage>
<lpage>242</lpage>
<pub-id pub-id-type="doi">10.1089/10665270252935430</pub-id>
<pub-id pub-id-type="pmid">12015879</pub-id>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="journal">
<name>
<surname>Bailey</surname>
<given-names>TL</given-names>
</name>
<name>
<surname>Elkan</surname>
<given-names>C</given-names>
</name>
<article-title>Unsupervised learning of multiple motifs in biopolymers using expectation maximization</article-title>
<source>Mach Learning</source>
<year>1995</year>
<volume>21</volume>
<fpage>51</fpage>
<lpage>80</lpage>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="journal">
<name>
<surname>Lawrence</surname>
<given-names>CE</given-names>
</name>
<name>
<surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name>
<surname>Boguski</surname>
<given-names>MS</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>JS</given-names>
</name>
<name>
<surname>Neuwald</surname>
<given-names>AF</given-names>
</name>
<name>
<surname>Wootton</surname>
<given-names>JC</given-names>
</name>
<article-title>Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment</article-title>
<source>Science</source>
<year>1993</year>
<volume>262</volume>
<fpage>208</fpage>
<lpage>214</lpage>
<pub-id pub-id-type="doi">10.1126/science.8211139</pub-id>
<pub-id pub-id-type="pmid">8211139</pub-id>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="book">
<name>
<surname>Rocke</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Tompa</surname>
<given-names>M</given-names>
</name>
<article-title>An algorithm for finding novel gapped motifs in DNA sequences</article-title>
<source>Proceedings of the Second Annual International Conference on Computational Molecular Biology, RECOMB ’98</source>
<year>1998</year>
<publisher-name>New York: ACM</publisher-name>
<fpage>228</fpage>
<lpage>233</lpage>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal">
<name>
<surname>Keich</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Pevzner</surname>
<given-names>P</given-names>
</name>
<article-title>Finding motifs in the twilight zone</article-title>
<source>Bioinformatics</source>
<year>2002</year>
<volume>18</volume>
<fpage>1374</fpage>
<lpage>1381</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/18.10.1374</pub-id>
<pub-id pub-id-type="pmid">12376382</pub-id>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="book">
<name>
<surname>Timothy</surname>
<given-names>LB</given-names>
</name>
<name>
<surname>Elkan</surname>
<given-names>C</given-names>
</name>
<article-title>Fitting a mixture model by expectation maximization to discover motifs in biopolymers</article-title>
<source>Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology</source>
<year>1994</year>
<publisher-name>Stanford, CA: AAAI Press</publisher-name>
<fpage>28</fpage>
<lpage>36</lpage>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="journal">
<name>
<surname>Hertz</surname>
<given-names>GZ</given-names>
</name>
<name>
<surname>Stormo</surname>
<given-names>GD</given-names>
</name>
<article-title>Identifying DNA and protein patterns with statistically significant alignments of multiple sequences</article-title>
<source>Bioinformatics</source>
<year>1999</year>
<volume>15</volume>
<fpage>563</fpage>
<lpage>577</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/15.7.563</pub-id>
<pub-id pub-id-type="pmid">10487864</pub-id>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal">
<name>
<surname>Eskin</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
<article-title>Finding composite regulatory patterns in DNA sequences</article-title>
<source>Bioinformatics</source>
<year>2002</year>
<volume>18</volume>
<fpage>354</fpage>
<lpage>363</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/18.suppl_1.S354</pub-id>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="book">
<name>
<surname>Pisanti</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Carvalho</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Marsan</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Sagot</surname>
<given-names>MF</given-names>
</name>
<person-group person-group-type="editor">Correa J, Hevia A, Kiwi M</person-group>
<article-title>RISOTTO: Fast Extraction of Motifs with Mismatches</article-title>
<source>LATIN 2006: Theoretical Informatics, Volume 3887 of Lecture Notes in Computer Science</source>
<year>2006</year>
<publisher-name>Berlin / Heidelberg: Springer</publisher-name>
<fpage>757</fpage>
<lpage>768</lpage>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="journal">
<name>
<surname>Rajasekaran</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Balla</surname>
<given-names>S</given-names>
</name>
<name>
<surname>hsi Huang</surname>
<given-names>C</given-names>
</name>
<article-title>Exact algorithm for planted motif challenge problems</article-title>
<source>J Comput Biol</source>
<year>2005</year>
<volume>12</volume>
<fpage>249</fpage>
<lpage>259</lpage>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="book">
<name>
<surname>Chin</surname>
<given-names>FYL</given-names>
</name>
<name>
<surname>Leung</surname>
<given-names>HCM</given-names>
</name>
<article-title>Voting algorithms for discovering long motifs</article-title>
<source>Proceedings of the Third Asia-Pacific Bioinformatics Conference (APBC2005)</source>
<year>2005</year>
<publisher-name>London: Imperial College Press</publisher-name>
<fpage>261</fpage>
<lpage>271</lpage>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="journal">
<name>
<surname>Davila</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Balla</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Rajasekaran</surname>
<given-names>S</given-names>
</name>
<article-title>Fast and practical algorithms for planted (l, d)motif search</article-title>
<source>IEEE/ACM Trans Comput Biol Bioinformatics</source>
<year>2007</year>
<volume>4</volume>
<fpage>544</fpage>
<lpage>552</lpage>
</mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="journal">
<name>
<surname>Rajasekaran</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Dinh</surname>
<given-names>H</given-names>
</name>
<article-title>A speedup technique for (l, d)-motif finding algorithms</article-title>
<source>BMC Res Notes</source>
<year>2011</year>
<volume>4</volume>
<fpage>54</fpage>
<pub-id pub-id-type="doi">10.1186/1756-0500-4-54</pub-id>
<pub-id pub-id-type="pmid">21385438</pub-id>
</mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="journal">
<name>
<surname>Dinh</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Rajasekaran</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Kundeti</surname>
<given-names>V</given-names>
</name>
<article-title>PMS5: an efficient exact algorithm for the (l, d)-motif finding problem</article-title>
<source>BMC Bioinformatics</source>
<year>2011</year>
<volume>12</volume>
<fpage>410</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-12-410</pub-id>
<pub-id pub-id-type="pmid">22024209</pub-id>
</mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="book">
<name>
<surname>Bandyopadhyay</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sahni</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Rajasekaran</surname>
<given-names>S</given-names>
</name>
<article-title>PMS6: A fast algorithm for motif discovery</article-title>
<source>Proc. 2012 IEEE 2nd International Conference on Computational Advances in Bio and Medical Sciences (ICCABS)</source>
<year>2012</year>
<publisher-name>New York: IEEE</publisher-name>
<fpage>1</fpage>
<lpage>6</lpage>
</mixed-citation>
</ref>
<ref id="B21">
<mixed-citation publication-type="journal">
<name>
<surname>Kuksa</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Pavlovic</surname>
<given-names>V</given-names>
</name>
<article-title>Efficient motif finding algorithms for large-alphabet inputs</article-title>
<source>BMC Bioinformatics</source>
<year>2010</year>
<volume>11</volume>
<fpage>S1</fpage>
<pub-id pub-id-type="pmid">21034426</pub-id>
</mixed-citation>
</ref>
<ref id="B22">
<mixed-citation publication-type="journal">
<name>
<surname>Davila</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Balla</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Rajasekaran</surname>
<given-names>S</given-names>
</name>
<article-title>Fast and practical algorithms for planted (l, d)motif search</article-title>
<source>IEEE/ACM Trans Comput Biol Bioinformatics</source>
<year>2007</year>
<volume>4</volume>
<issue>4</issue>
<fpage>544</fpage>
<lpage>552</lpage>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000946  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000946  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021