MersV1, Pmc, Corpus, bibRecord, 001075

PairMotif: A New Pattern-Driven Algorithm for Planted (l, d) DNA Motif Search

Identifieur interne : 001075 ( Pmc/Corpus ); précédent : 001074; suivant : 001076

PairMotif: A New Pattern-Driven Algorithm for Planted (l, d) DNA Motif Search

Auteurs : Qiang Yu ; Hongwei Huo ; Yipu Zhang ; Hongzhi Guo

Source :

PLoS ONE [ 1932-6203 ] ; 2012.

RBID : PMC:3485246

Abstract

Motif search is a fundamental problem in bioinformatics with an important application in locating transcription factor binding sites (TFBSs) in DNA sequences. The exact algorithms can report all (l, d) motifs and find the best one under a specific objective function. However, it is still a challenging task to identify weak motifs, since either a large amount of memory or execution time is required by current exact algorithms. A new exact algorithm, PairMotif, is proposed for planted (l, d) motif search (PMS) in this paper. To effectively reduce both candidate motifs and scanned l-mers, multiple pairs of l-mers with relatively large distances are selected from input sequences to restrict the search space. Comparisons with several recently proposed algorithms show that PairMotif requires less storage space and runs faster on most PMS instances. Particularly, among the algorithms compared, only PairMotif can solve the weak instance (27, 9) within 10 hours. Moreover, the performance of PairMotif is stable over the sequence length, which allows it to identify motifs in longer sequences. For the real biological data, experimental results demonstrate the validity of the proposed algorithm.

Url:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3485246

DOI: 10.1371/journal.pone.0048442
PubMed: 23119020
PubMed Central: 3485246

Links to Exploration step

PMC:3485246

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">PairMotif: A New Pattern-Driven Algorithm for Planted (<italic>l</italic>
, <italic>d</italic>
) DNA Motif Search</title>
<author><name sortKey="Yu, Qiang" sort="Yu, Qiang" uniqKey="Yu Q" first="Qiang" last="Yu">Qiang Yu</name>
<affiliation><nlm:aff id="aff1"></nlm:aff>
</affiliation>
</author>
<author><name sortKey="Huo, Hongwei" sort="Huo, Hongwei" uniqKey="Huo H" first="Hongwei" last="Huo">Hongwei Huo</name>
<affiliation><nlm:aff id="aff1"></nlm:aff>
</affiliation>
</author>
<author><name sortKey="Zhang, Yipu" sort="Zhang, Yipu" uniqKey="Zhang Y" first="Yipu" last="Zhang">Yipu Zhang</name>
<affiliation><nlm:aff id="aff1"></nlm:aff>
</affiliation>
</author>
<author><name sortKey="Guo, Hongzhi" sort="Guo, Hongzhi" uniqKey="Guo H" first="Hongzhi" last="Guo">Hongzhi Guo</name>
<affiliation><nlm:aff id="aff1"></nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">23119020</idno>
<idno type="pmc">3485246</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3485246</idno>
<idno type="RBID">PMC:3485246</idno>
<idno type="doi">10.1371/journal.pone.0048442</idno>
<date when="2012">2012</date>
<idno type="wicri:Area/Pmc/Corpus">001075</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">001075</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">PairMotif: A New Pattern-Driven Algorithm for Planted (<italic>l</italic>
, <italic>d</italic>
) DNA Motif Search</title>
<author><name sortKey="Yu, Qiang" sort="Yu, Qiang" uniqKey="Yu Q" first="Qiang" last="Yu">Qiang Yu</name>
<affiliation><nlm:aff id="aff1"></nlm:aff>
</affiliation>
</author>
<author><name sortKey="Huo, Hongwei" sort="Huo, Hongwei" uniqKey="Huo H" first="Hongwei" last="Huo">Hongwei Huo</name>
<affiliation><nlm:aff id="aff1"></nlm:aff>
</affiliation>
</author>
<author><name sortKey="Zhang, Yipu" sort="Zhang, Yipu" uniqKey="Zhang Y" first="Yipu" last="Zhang">Yipu Zhang</name>
<affiliation><nlm:aff id="aff1"></nlm:aff>
</affiliation>
</author>
<author><name sortKey="Guo, Hongzhi" sort="Guo, Hongzhi" uniqKey="Guo H" first="Hongzhi" last="Guo">Hongzhi Guo</name>
<affiliation><nlm:aff id="aff1"></nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">PLoS ONE</title>
<idno type="eISSN">1932-6203</idno>
<imprint><date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><p>Motif search is a fundamental problem in bioinformatics with an important application in locating transcription factor binding sites (TFBSs) in DNA sequences. The exact algorithms can report all (<italic>l</italic>
, <italic>d</italic>
) motifs and find the best one under a specific objective function. However, it is still a challenging task to identify weak motifs, since either a large amount of memory or execution time is required by current exact algorithms. A new exact algorithm, PairMotif, is proposed for planted (<italic>l</italic>
, <italic>d</italic>
) motif search (PMS) in this paper. To effectively reduce both candidate motifs and scanned <italic>l</italic>
-mers, multiple pairs of <italic>l</italic>
-mers with relatively large distances are selected from input sequences to restrict the search space. Comparisons with several recently proposed algorithms show that PairMotif requires less storage space and runs faster on most PMS instances. Particularly, among the algorithms compared, only PairMotif can solve the weak instance (27, 9) within 10 hours. Moreover, the performance of PairMotif is stable over the sequence length, which allows it to identify motifs in longer sequences. For the real biological data, experimental results demonstrate the validity of the proposed algorithm.</p>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Lawrence, Ce" uniqKey="Lawrence C">CE Lawrence</name>
</author>
<author><name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author><name sortKey="Boguski, Ms" uniqKey="Boguski M">MS Boguski</name>
</author>
<author><name sortKey="Liu, Js" uniqKey="Liu J">JS Liu</name>
</author>
<author><name sortKey="Neuwald, Af" uniqKey="Neuwald A">AF Neuwald</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Buhler, J" uniqKey="Buhler J">J Buhler</name>
</author>
<author><name sortKey="Tompa, M" uniqKey="Tompa M">M Tompa</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Fratkin, E" uniqKey="Fratkin E">E Fratkin</name>
</author>
<author><name sortKey="Naughton, Bt" uniqKey="Naughton B">BT Naughton</name>
</author>
<author><name sortKey="Brutlag, Dl" uniqKey="Brutlag D">DL Brutlag</name>
</author>
<author><name sortKey="Batzoglou, S" uniqKey="Batzoglou S">S Batzoglou</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bi, Cp" uniqKey="Bi C">CP Bi</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Huang, Cw" uniqKey="Huang C">CW Huang</name>
</author>
<author><name sortKey="Lee, Ws" uniqKey="Lee W">WS Lee</name>
</author>
<author><name sortKey="Hsieh, Sy" uniqKey="Hsieh S">SY Hsieh</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Evans, Pa" uniqKey="Evans P">PA Evans</name>
</author>
<author><name sortKey="Smith, A" uniqKey="Smith A">A Smith</name>
</author>
<author><name sortKey="Wareham, Ht" uniqKey="Wareham H">HT Wareham</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Eskin, E" uniqKey="Eskin E">E Eskin</name>
</author>
<author><name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Marsan, L" uniqKey="Marsan L">L Marsan</name>
</author>
<author><name sortKey="Sagot, Mf" uniqKey="Sagot M">MF Sagot</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Davila, J" uniqKey="Davila J">J Davila</name>
</author>
<author><name sortKey="Balla, S" uniqKey="Balla S">S Balla</name>
</author>
<author><name sortKey="Rajasekaran, S" uniqKey="Rajasekaran S">S Rajasekaran</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Bulyk, Ml" uniqKey="Bulyk M">ML Bulyk</name>
</author>
<author><name sortKey="Johnson, Pfl" uniqKey="Johnson P">PFL Johnson</name>
</author>
<author><name sortKey="Church, Gm" uniqKey="Church G">GM Church</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Liu, X" uniqKey="Liu X">X Liu</name>
</author>
<author><name sortKey="Brutlag, Dl" uniqKey="Brutlag D">DL Brutlag</name>
</author>
<author><name sortKey="Liu, Js" uniqKey="Liu J">JS Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Pavesi, G" uniqKey="Pavesi G">G Pavesi</name>
</author>
<author><name sortKey="Mauri, G" uniqKey="Mauri G">G Mauri</name>
</author>
<author><name sortKey="Pesole, G" uniqKey="Pesole G">G Pesole</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Tompa, M" uniqKey="Tompa M">M Tompa</name>
</author>
<author><name sortKey="Li, N" uniqKey="Li N">N Li</name>
</author>
<author><name sortKey="Bailey, Tl" uniqKey="Bailey T">TL Bailey</name>
</author>
<author><name sortKey="Church, Gm" uniqKey="Church G">GM Church</name>
</author>
<author><name sortKey="Moor, Bd" uniqKey="Moor B">BD Moor</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">PLoS One</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS ONE</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">plosone</journal-id>
<journal-title-group><journal-title>PLoS ONE</journal-title>
</journal-title-group>
<issn pub-type="epub">1932-6203</issn>
<publisher><publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">23119020</article-id>
<article-id pub-id-type="pmc">3485246</article-id>
<article-id pub-id-type="publisher-id">PONE-D-12-19951</article-id>
<article-id pub-id-type="doi">10.1371/journal.pone.0048442</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject>
</subj-group>
<subj-group subj-group-type="Discipline-v2"><subject>Biology</subject>
<subj-group><subject>Biochemistry</subject>
<subj-group><subject>Proteins</subject>
<subj-group><subject>DNA-binding proteins</subject>
</subj-group>
</subj-group>
</subj-group>
<subj-group><subject>Biotechnology</subject>
</subj-group>
<subj-group><subject>Computational Biology</subject>
<subj-group><subject>Molecular Genetics</subject>
<subj-group><subject>Gene Regulation</subject>
</subj-group>
</subj-group>
<subj-group><subject>Sequence Analysis</subject>
<subject>Text Mining</subject>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v2"><subject>Computer Science</subject>
<subj-group><subject>Algorithms</subject>
</subj-group>
<subj-group><subject>Computer Modeling</subject>
</subj-group>
<subj-group><subject>Computing Methods</subject>
</subj-group>
<subj-group><subject>Text Mining</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group><article-title>PairMotif: A New Pattern-Driven Algorithm for Planted (<italic>l</italic>
, <italic>d</italic>
) DNA Motif Search</article-title>
<alt-title alt-title-type="running-head">A New Pattern-Driven Algorithm for Motif Search</alt-title>
</title-group>
<contrib-group><contrib contrib-type="author"><name><surname>Yu</surname>
<given-names>Qiang</given-names>
</name>
<xref ref-type="aff" rid="aff1"></xref>
</contrib>
<contrib contrib-type="author"><name><surname>Huo</surname>
<given-names>Hongwei</given-names>
</name>
<xref ref-type="aff" rid="aff1"></xref>
<xref ref-type="corresp" rid="cor1"><sup>*</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Zhang</surname>
<given-names>Yipu</given-names>
</name>
<xref ref-type="aff" rid="aff1"></xref>
</contrib>
<contrib contrib-type="author"><name><surname>Guo</surname>
<given-names>Hongzhi</given-names>
</name>
<xref ref-type="aff" rid="aff1"></xref>
</contrib>
</contrib-group>
<aff id="aff1"><addr-line>School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi, China</addr-line>
</aff>
<contrib-group><contrib contrib-type="editor"><name><surname>Gómez</surname>
<given-names>Sergio</given-names>
</name>
<role>Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1"><addr-line>Universitat Rovira i Virgili, Spain</addr-line>
</aff>
<author-notes><corresp id="cor1">* E-mail: <email>hongweihuo1234@gmail.com</email>
</corresp>
<fn fn-type="COI-statement"><p><bold>Competing Interests: </bold>
The authors have declared that no competing interests exist.</p>
</fn>
<fn fn-type="con"><p>Conceived and designed the experiments: QY HWH. Performed the experiments: QY. Analyzed the data: QY HWH YPZ HZG. Wrote the paper: QY HWH YPZ HZG.</p>
</fn>
</author-notes>
<pub-date pub-type="collection"><year>2012</year>
</pub-date>
<pub-date pub-type="epub"><day>31</day>
<month>10</month>
<year>2012</year>
</pub-date>
<volume>7</volume>
<issue>10</issue>
<elocation-id>e48442</elocation-id>
<history><date date-type="received"><day>4</day>
<month>7</month>
<year>2012</year>
</date>
<date date-type="accepted"><day>25</day>
<month>9</month>
<year>2012</year>
</date>
</history>
<permissions><copyright-statement>© 2012 Yu et al</copyright-statement>
<copyright-year>2012</copyright-year>
<copyright-holder>Yu et al</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><license-p>This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.</license-p>
</license>
</permissions>
<abstract><p>Motif search is a fundamental problem in bioinformatics with an important application in locating transcription factor binding sites (TFBSs) in DNA sequences. The exact algorithms can report all (<italic>l</italic>
, <italic>d</italic>
) motifs and find the best one under a specific objective function. However, it is still a challenging task to identify weak motifs, since either a large amount of memory or execution time is required by current exact algorithms. A new exact algorithm, PairMotif, is proposed for planted (<italic>l</italic>
, <italic>d</italic>
) motif search (PMS) in this paper. To effectively reduce both candidate motifs and scanned <italic>l</italic>
-mers, multiple pairs of <italic>l</italic>
-mers with relatively large distances are selected from input sequences to restrict the search space. Comparisons with several recently proposed algorithms show that PairMotif requires less storage space and runs faster on most PMS instances. Particularly, among the algorithms compared, only PairMotif can solve the weak instance (27, 9) within 10 hours. Moreover, the performance of PairMotif is stable over the sequence length, which allows it to identify motifs in longer sequences. For the real biological data, experimental results demonstrate the validity of the proposed algorithm.</p>
</abstract>
<funding-group><funding-statement>This work was supported by the National Natural Science Foundation of China, Grant No. 61173025, and Research Fund for the Doctoral Program of Higher Education of China, Grant No. 20100203110010. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</funding-statement>
</funding-group>
<counts><page-count count="10"></page-count>
</counts>
</article-meta>
</front>
<body><sec id="s1"><title>Introduction</title>
<p>Motif search plays an important role in gene finding and gene regulation relationship understanding. Taking a survey of recent developments in motif recognition algorithms, Das and Dai <xref rid="pone.0048442-Das1" ref-type="bibr">[1]</xref>
 pointed out that DNA motif search would still be an opening challenge for researchers. In this paper, we focus on the planted (<italic>l</italic>
, <italic>d</italic>
) motif search <xref rid="pone.0048442-Pevzner1" ref-type="bibr">[2]</xref>
, a widely used model for motif finding:</p>
<sec id="s1a"><title>Planted (<italic>l</italic>
, <italic>d</italic>
) Motif Search (PMS)</title>
<p>Given a set of <italic>n</italic>
-length sequences <italic>S</italic>
 = {<italic>s</italic>
<sub>1</sub>
, <italic>s</italic>
<sub>2</sub>
, …, <italic>s<sub>t</sub>
</italic>
} over the alphabet {A, T, C, G} and nonnegative integers <italic>l</italic>
 and <italic>d</italic>
, satisfying 0≤ <italic>d</italic>
 < <italic>l</italic>
 < <italic>n</italic>
, the task is to find an <italic>l</italic>
-mer (i.e., an <italic>l</italic>
-length string) <italic>m</italic>
, called a motif, such that each sequence <italic>s<sub>i</sub>
</italic>
 contains an <italic>l</italic>
-mer <italic>m<sub>i</sub>
</italic>
 differing from <italic>m</italic>
 in at most <italic>d</italic>
 positions.</p>
<p>In the PMS problem, typical values of <italic>t</italic>
 and <italic>n</italic>
 are 20 and 600; then, various combinations of <italic>l</italic>
 and <italic>d</italic>
 correspond to different instances of PMS. The instances where the value of <italic>d</italic>
 is large in relation to the value of <italic>l</italic>
 are called weak instances, which are difficult to be solved. For example, the instances (15, 4) and (18, 6) are well-known weak instances <xref rid="pone.0048442-Boucher1" ref-type="bibr">[3]</xref>
.</p>
<p>Numerous recognition algorithms, either approximate or exact, have been proposed. The initialization of most approximate algorithms is selecting start sites randomly to begin iterations, which makes them easily fall into a local optimum. Gibbs Sampling <xref rid="pone.0048442-Lawrence1" ref-type="bibr">[4]</xref>
 and MEME <xref rid="pone.0048442-Bailey1" ref-type="bibr">[5]</xref>
 are classic algorithms in this approach, and they usually use several groups of start sites as the initial states to avoid the local optimum. PROJECTION <xref rid="pone.0048442-Buhler1" ref-type="bibr">[6]</xref>
 partitions all <italic>l</italic>
-mers in <italic>S</italic>
 into many buckets, and selects some valid buckets that contain several occurrences of the desired motif and little else. Then EM algorithm is used to refine the valid buckets. MotifCut <xref rid="pone.0048442-Fratkin1" ref-type="bibr">[7]</xref>
 is a graph-theoretic approach in which motif finding is formulated as a search for the maximum density subgraph. MCEMDA <xref rid="pone.0048442-Bi1" ref-type="bibr">[8]</xref>
, a Monte Carlo version of the EM motif finding algorithm, starts from an initial model, and then it iteratively performs Monte Carlo simulation and parameter update until convergence. SBaSeTraM <xref rid="pone.0048442-Miller1" ref-type="bibr">[9]</xref>
 is a Bayesian search method obtained by combining two models, namely a foreground model and a background model, which describe the distribution of sequences under the alternative hypothesis that there is a TFBS and under the null hypothesis that there is no TFBS at the site, respectively. Vine <xref rid="pone.0048442-Huang1" ref-type="bibr">[10]</xref>
, the recent method, is a polynomial-time heuristic algorithm for motif search based on WINNOWER <xref rid="pone.0048442-Pevzner1" ref-type="bibr">[2]</xref>
. Generally, the approximate algorithms are able to produce results in a short time, but they cannot guarantee global optimality.</p>
<p>Exact recognition algorithms are guaranteed to obtain the best solution, although exponential running time is required in the worst case due to the NP-hard nature of PMS <xref rid="pone.0048442-Evans1" ref-type="bibr">[11]</xref>
. It is therefore meaningful to design the exact algorithm with a small time overhead. According to the search space of PMS, there are two types of exact recognition algorithms. One is the exact algorithms based on alignment matrix, which test all (<italic>n</italic>
 − <italic>l</italic>
 +1)<italic><sup>t</sup>
</italic>
 possible combinations of motif positions in each of sequences to find the one that yields the highest score. The other is the exact algorithms based on pattern, which verify all 4<italic><sup>l</sup>
</italic>
 possible patterns to find the one that appears in all <italic>t</italic>
 sequences with the minimum number of mutations. Although the two types of exact algorithms produce the consistent results <xref rid="pone.0048442-Jones1" ref-type="bibr">[12]</xref>
, the initial search space of the latter is much smaller than that of the former. Therefore, the research mainly concentrates on the latter in recent years, and the associated algorithms <xref rid="pone.0048442-Eskin1" ref-type="bibr">[13]</xref>
–<xref rid="pone.0048442-Dinh1" ref-type="bibr">[22]</xref>
 are called pattern-driven algorithms.</p>
<p>All pattern-driven algorithms aim to reduce candidate motifs through various means. However, no single algorithm is superior to others on all PMS instances. Being the fastest in the family of suffix tree algorithms <xref rid="pone.0048442-Eskin1" ref-type="bibr">[13]</xref>
–<xref rid="pone.0048442-Pisanti1" ref-type="bibr">[18]</xref>
, RISOTTO <xref rid="pone.0048442-Pisanti1" ref-type="bibr">[18]</xref>
 shows competitive execution time, but its performance drops significantly with the increase of the motif length. PMSP <xref rid="pone.0048442-Davila1" ref-type="bibr">[19]</xref>
 adopts the following idea: for each <italic>l-</italic>
mer <italic>x</italic>
 in <italic>s</italic>
<sub>1</sub>
, it generates <italic>d</italic>
-neighbors of <italic>x</italic>
, and then verifies if each <italic>d</italic>
-neighbor <italic>y</italic>
 is a motif by checking whether there are <italic>l-</italic>
mers in <italic>s<sub>i</sub>
</italic>
 (2≤ <italic>i</italic>
 ≤ <italic>t</italic>
) that are at distance <italic>d</italic>
 from <italic>y</italic>
. PMSprune <xref rid="pone.0048442-Davila2" ref-type="bibr">[20]</xref>
 is an improvement over PMSP obtained by using branch and bound, capable of solving many PMS instances in a short time. The iTriplet algorithm <xref rid="pone.0048442-Ho1" ref-type="bibr">[21]</xref>
, which constructs multiple triplets of <italic>l</italic>
-mers to reduce candidate motifs, is suitable for identifying long motifs (>20 nucleotides), but suffers from substantial computational requirements. PMS5 <xref rid="pone.0048442-Dinh1" ref-type="bibr">[22]</xref>
, whose main idea is to use integer programming to compute the common <italic>d</italic>
-neighbors of three <italic>l-</italic>
mers, is an efficient algorithm for solving weak PMS instances with the value of <italic>l</italic>
 about 20. But PMS5 is difficult to solve weak instances with large values of <italic>l</italic>
, because of the substantial memory required for storing the results of all possible integer linear programs. There are some other methods for the exact algorithms, see for example: <xref rid="pone.0048442-Chin1" ref-type="bibr">[23]</xref>
–<xref rid="pone.0048442-Rajasekaran1" ref-type="bibr">[25]</xref>
 etc.</p>
<p>Besides the search method, objective functions also play an important role in motif search. The analysis of wild-type and mutant Zif268 zinc fingers genes made by Bulyk et al. <xref rid="pone.0048442-Bulyk1" ref-type="bibr">[26]</xref>
 indicated that there exists interdependence among positions in transcription factor binding sites for real biological data. If the objective function cannot express the interdependence, the obtained optimal solution may not possess a real biological significance. Li and Tompa <xref rid="pone.0048442-Li1" ref-type="bibr">[27]</xref>
 assessed and classified the objective functions used by existing tools, and they pointed out that an ideal objective function should assign the optimal score to the true motif. For the exact algorithms, the objective function is used to rank reported (<italic>l</italic>
, <italic>d</italic>
) motifs, and the prediction is done by selecting the motif with best score.</p>
<p>To effectively solve the PMS problem, a new pattern-driven algorithm, PairMotif, is proposed in this paper. First, motivated by the observation that the number of candidate motifs shared by two <italic>l-</italic>
mers is dramatically affected by the distance between the two <italic>l</italic>
-mers, multiple pairs of <italic>l</italic>
-mers are carefully selected from input sequences to limit the total candidate volume. Second, a new filtering rule based on a pair of <italic>l</italic>
-mers is designed to filter out more <italic>l</italic>
-mers to be scanned in candidate verification. Third, a deterministic and efficient method for traversing candidate motifs is given to perform candidate verification. The experimental results demonstrate the efficiency and effectiveness of the proposed algorithm.</p>
</sec>
</sec>
<sec sec-type="methods" id="s2"><title>Methods</title>
<sec id="s2a"><title>Ethics Statement</title>
<p>N/A. The ethics approval is not necessary for our study. That is because: 1) our study doesn’t make use of any human or vertebrate animal subjects and tissue; 2) our study focuses on faster algorithms for planted (<italic>l</italic>
, <italic>d</italic>
) motif search, which is a widely used computing model for DNA motif search, and our experiments are completed only by using computers.</p>
</sec>
<sec id="s2b"><title>Notations and Definitions</title>
<p>The Hamming distance between two <italic>l</italic>
-mers <italic>x</italic>
 and <italic>x</italic>
′ is denoted by <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′). In PairMotif, all pairs of <italic>l</italic>
-mers <italic>x</italic>
 and <italic>x</italic>
′ that satisfy <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′) >2<italic>d</italic>
 are ignored, since the Hamming distance between two instances of the same motif must be less than or equal to 2<italic>d</italic>
 and 2<italic>d</italic>
 is less than <italic>l</italic>
. Given an <italic>l</italic>
-mer <italic>x</italic>
 and a sequence <italic>s</italic>
, let <italic>C</italic>
(<italic>x</italic>
, <italic>s</italic>
) be the set of all such <italic>l</italic>
-mers <italic>y</italic>
 in <italic>s</italic>
 that <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
, <italic>y</italic>
) ≤2<italic>d</italic>
.</p>
<p>Given two <italic>l</italic>
-mers <italic>x</italic>
 and <italic>x</italic>
′, let <italic>P</italic>
<sub>1</sub>
(<italic>x</italic>
, <italic>x</italic>
′) be the set of positions at which the corresponding characters are identical, and <italic>P</italic>
<sub>0</sub>
(<italic>x</italic>
, <italic>x</italic>
′) be the set of positions at which the corresponding characters are different. Given another <italic>l</italic>
-mer <italic>y</italic>
, the <italic>l</italic>
 positions in the alignment of <italic>x</italic>
, <italic>x</italic>
′ and <italic>y</italic>
 can be divided into four categories: <italic>P</italic>
<sub>00</sub>
(<italic>x</italic>
, <italic>x</italic>
′, <italic>y</italic>
), <italic>P</italic>
<sub>01</sub>
(<italic>x</italic>
, <italic>x</italic>
′, <italic>y</italic>
), <italic>P</italic>
<sub>10</sub>
(<italic>x</italic>
, <italic>x</italic>
′, <italic>y</italic>
) and <italic>P</italic>
<sub>11</sub>
(<italic>x</italic>
, <italic>x</italic>
′, <italic>y</italic>
). More precisely, for each position <italic>i</italic>
, assume that it belongs to <italic>P<sub>ab</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′, <italic>y</italic>
). Then, <italic>a</italic>
 is 1 if <italic>x</italic>
[<italic>i</italic>
] = <italic>x</italic>
′[<italic>i</italic>
], 0 otherwise; <italic>b</italic>
 is 1 if either <italic>y</italic>
[<italic>i</italic>
] = <italic>x</italic>
[<italic>i</italic>
] or <italic>y</italic>
[<italic>i</italic>
] = <italic>x</italic>
′[<italic>i</italic>
], 0 otherwise. <xref ref-type="fig" rid="pone-0048442-g001">Figure 1</xref>
 shows an example for partitioning the positions in the alignment of two <italic>l</italic>
-mers and that of three <italic>l</italic>
-mers.</p>
<fig id="pone-0048442-g001" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0048442.g001</object-id>
<label>Figure 1</label>
<caption><title>An example for partitioning positions in the alignment of two/three <italic>l</italic>
-mers.</title>
<p>This figure shows an example for partitioning positions in the alignment of two/three 15-mers.</p>
</caption>
<graphic xlink:href="pone.0048442.g001"></graphic>
</fig>
<sec id="s2b1"><title>Definition 1</title>
<p>Given two <italic>l</italic>
-mers <italic>x</italic>
 and <italic>x</italic>
′, the candidate motifs shared by <italic>x</italic>
 and <italic>x</italic>
′, <italic>M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′), is defined to be {<italic>y</italic>
: |<italic>y</italic>
| = <italic>l</italic>
, <italic>d<sub>H</sub>
</italic>
(<italic>y</italic>
, <italic>x</italic>
) ≤ <italic>d</italic>
 and <italic>d<sub>H</sub>
</italic>
(<italic>y</italic>
, <italic>x</italic>
′) ≤ <italic>d</italic>
}.</p>
</sec>
<sec id="s2b2"><title>Definition 2</title>
<p>Given three <italic>l</italic>
-mers <italic>x</italic>
, <italic>x</italic>
′ and <italic>y</italic>
 with <italic>y</italic>
 ∈ <italic>M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′), the mapping relation from <italic>x</italic>
 and <italic>x</italic>
′ to <italic>y</italic>
, <italic>R</italic>
(<italic>x</italic>
, <italic>x</italic>
′, <italic>y</italic>
), is defined to be a 2-tuple <<italic>α</italic>
, <italic>β</italic>
> = <|<italic>P</italic>
<sub>10</sub>
(<italic>x</italic>
, <italic>x</italic>
′, <italic>y</italic>
)|, |<italic>P</italic>
<sub>00</sub>
(<italic>x</italic>
, <italic>x</italic>
′, <italic>y</italic>
)|>. Furthermore, the mapping relation from <italic>x</italic>
 and <italic>x</italic>
′ to <italic>M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′), <italic>R</italic>
(<italic>x</italic>
, <italic>x</italic>
′), is defined to be <inline-formula><inline-graphic xlink:href="pone.0048442.e001.jpg"></inline-graphic>
</inline-formula>
.</p>
</sec>
</sec>
<sec id="s2c"><title>PairMotif Algorithm</title>
<p>As shown in <xref ref-type="fig" rid="pone-0048442-g002">Figure 2</xref>
, PairMotif consists of the following three stages:</p>
<fig id="pone-0048442-g002" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0048442.g002</object-id>
<label>Figure 2</label>
<caption><title>Illustration of the PairMotif algorithm.</title>
<p>This figure takes the instance (15, 4) as an example to explain the process of PairMotif, which consists of three stages: selecting pairs, filtering <italic>l</italic>
-mers and verifying candidate motifs.</p>
</caption>
<graphic xlink:href="pone.0048442.g002"></graphic>
</fig>
<list list-type="order"><list-item><p>Selecting pairs. For each <italic>l</italic>
-mer <italic>x</italic>
 in <italic>s</italic>
<sub>1</sub>
, select a reference sequence <italic>s<sub>r</sub>
</italic>
 from <italic>S</italic>
 − {<italic>s</italic>
<sub>1</sub>
}. Then, multiple pairs of <italic>l</italic>
-mers are formed by pairing the <italic>l</italic>
-mer <italic>x</italic>
 and each <italic>l</italic>
-mer <italic>x</italic>
′ in <italic>C</italic>
(<italic>x</italic>
, <italic>s<sub>r</sub>
</italic>
). Note that, <italic>s<sub>r</sub>
</italic>
 is selected with the most restrictive limit on the total number of candidate motifs in comparison with other sequences.</p>
</list-item>
<list-item><p>Filtering <italic>l</italic>
-mers. For each selected pair of <italic>l</italic>
-mers <italic>x</italic>
 and <italic>x</italic>
′, use two filtering rules to filter out <italic>l</italic>
-mers in <italic>C</italic>
(<italic>x</italic>
, <italic>s<sub>i</sub>
</italic>
) (2≤ <italic>i</italic>
 ≤ <italic>t</italic>
, <italic>i</italic>
 ≠ <italic>r</italic>
) that cannot be instances of any candidate motif in <italic>M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′). Let <italic>C</italic>
′(<italic>x</italic>
, <italic>s<sub>i</sub>
</italic>
) denote the set of the remaining <italic>l</italic>
-mers in <italic>C</italic>
(<italic>x</italic>
, <italic>s<sub>i</sub>
</italic>
) after filtering.</p>
</list-item>
<list-item><p>Verifying candidate motifs. For each selected pair of <italic>l</italic>
-mers <italic>x</italic>
 and <italic>x</italic>
′, traverse each candidate motif <italic>y</italic>
 in <italic>M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′), and verify if <italic>y</italic>
 is a motif by checking whether there is an instance of <italic>y</italic>
 in each <italic>C</italic>
′(<italic>x</italic>
, <italic>s<sub>i</sub>
</italic>
).</p>
</list-item>
</list>
<p>Based on these three stages, the PairMotif algorithm is presented as follows.</p>
</sec>
<sec id="s2d"><title>Algorithm PairMotif</title>
<p><bold>Input:</bold>
<italic>l</italic>
, <italic>d</italic>
, <italic>S</italic>
 = {<italic>s</italic>
<sub>1</sub>
, <italic>s</italic>
<sub>2</sub>
, …, <italic>s<sub>t</sub>
</italic>
}</p>
<p><bold>Output:</bold>
 the (<italic>l</italic>
, <italic>d</italic>
) motif set <italic>M</italic>
</p>
<p>1: <italic>M</italic>
 ← <italic>Ф</italic>
</p>
<p>2: <bold>for</bold>
 each <italic>l</italic>
-mer <italic>x</italic>
 in <italic>s</italic>
<sub>1</sub>
<bold>do</bold>
</p>
<p>3: Compute <italic>C</italic>
(<italic>x, s<sub>i</sub>
</italic>
) (2≤ <italic>i</italic>
 ≤ <italic>t</italic>
)</p>
<p>4: Select a reference sequence <italic>s<sub>r</sub>
</italic>
 from <italic>S</italic>
 − {<italic>s</italic>
<sub>1</sub>
}</p>
<p>5: <bold>for</bold>
 each <italic>x</italic>
′ ∈ <italic>C</italic>
(<italic>x, s<sub>r</sub>
</italic>
) <bold>do</bold>
</p>
<p>6: Form a pair of <italic>l</italic>
-mers <italic>x</italic>
 and <italic>x</italic>
′</p>
<p>7: Filter <italic>l</italic>
-mers in <italic>C</italic>
(<italic>x, s<sub>i</sub>
</italic>
) (2≤ <italic>i</italic>
 ≤ <italic>t</italic>
, <italic>i</italic>
 ≠ <italic>r</italic>
) and store the remaining <italic>l</italic>
-mers in <italic>C</italic>
′(<italic>x, s<sub>i</sub>
</italic>
)</p>
<p>8: <bold>for</bold>
 each <italic>y</italic>
 ∈ <italic>M<sub>d</sub>
</italic>
(<italic>x, x</italic>
′) <bold>do</bold>
</p>
<p>9: <bold>if</bold>
 for every <italic>i</italic>
 (2<italic>≤ i ≤ t</italic>
, <italic>i</italic>
 ≠ <italic>r</italic>
), there is a <italic>y<sub>i</sub>
</italic>
 ∈ <italic>C</italic>
′(<italic>x, s<sub>i</sub>
</italic>
) such that <italic>d<sub>H</sub>
</italic>
(<italic>y, y<sub>i</sub>
</italic>
) ≤ <italic>d</italic>
<bold>then</bold>
</p>
<p>10: Add <italic>y</italic>
 to <italic>M</italic>
</p>
<p>11: Output <italic>M</italic>
</p>
<p>In line 1, the set of (<italic>l</italic>
, <italic>d</italic>
) motifs <italic>M</italic>
 is initialized to an empty set. Lines 2–6 correspond to the stage of selecting pairs, in which <italic>C</italic>
(<italic>x, s<sub>i</sub>
</italic>
) is obtained by calculating the Hamming distance from the <italic>l</italic>
-mer <italic>x</italic>
 to all <italic>l-</italic>
mers in <italic>s<sub>i</sub>
</italic>
 (2≤ <italic>i</italic>
 ≤ <italic>t</italic>
). Line 7 and lines 8–10 show the stages of filtering <italic>l</italic>
-mers and verifying candidate motifs, respectively. PairMotif guarantees to discover all (<italic>l</italic>
, <italic>d</italic>
) motifs and outputs them in line 11.</p>
<p>Next, we explain key techniques for each stage.</p>
</sec>
<sec id="s2e"><title>Stage 1: Selecting Pairs</title>
<p>For each <italic>l-</italic>
mer <italic>x</italic>
 in <italic>s</italic>
<sub>1</sub>
, a reference sequence <italic>s<sub>r</sub>
</italic>
 is required to form multiple pairs of <italic>l</italic>
-mers. Specifically, <italic>s<sub>r</sub>
</italic>
 is selected from <italic>S</italic>
 - {<italic>s</italic>
<sub>1</sub>
} by satisfying.<disp-formula id="pone.0048442.e002"><graphic xlink:href="pone.0048442.e002.jpg" position="anchor" orientation="portrait"></graphic>
<label>(1)</label>
</disp-formula>
where <inline-formula><inline-graphic xlink:href="pone.0048442.e003.jpg"></inline-graphic>
</inline-formula>
 denotes the number of candidate motifs determined by <italic>x</italic>
 and <italic>s<sub>i</sub>
</italic>
. After describing details of Stage 3, we will give the formula for calculating <italic>|M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′)|, which depends only on <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′) and parameters <italic>l</italic>
 and <italic>d</italic>
. In experiments, the values of <italic>|M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′)| under different Hamming distances are cached to speed up pairs selection.</p>
<p>This selection method is valuable for limiting the total number of candidate motifs.</p>
<sec id="s2e1"><title>Observation 1</title>
<p><italic>|M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′)| grows dramatically with the decrease of <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′).</p>
<p><xref ref-type="table" rid="pone-0048442-t001">Table 1</xref>
 takes the instance (15, 4) as an example to show the values of <italic>|M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′)| under different <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′). For any four 15-mers <italic>x</italic>
<sub>1</sub>
, <italic>x</italic>
<sub>1</sub>
′, <italic>x</italic>
<sub>2</sub>
 and <italic>x</italic>
<sub>2</sub>
′, even though the difference between <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
<sub>1</sub>
, <italic>x</italic>
<sub>1</sub>
′) and <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
<sub>2</sub>
, <italic>x</italic>
<sub>2</sub>
′) is not large, <italic>|M<sub>d</sub>
</italic>
(<italic>x</italic>
<sub>1</sub>
, <italic>x</italic>
<sub>1</sub>
′)| and <italic>|M<sub>d</sub>
</italic>
(<italic>x</italic>
<sub>2</sub>
, <italic>x</italic>
<sub>2</sub>
′)| can differ by several times or more. For example, when <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
<sub>1</sub>
, <italic>x</italic>
<sub>1</sub>
′) = 8 and <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
<sub>2</sub>
, <italic>x</italic>
<sub>2</sub>
′) = 4, their difference is 4, whereas <italic>|M<sub>d</sub>
</italic>
(<italic>x</italic>
<sub>2</sub>
, <italic>x</italic>
<sub>2</sub>
′)| is about 94 times <italic>|M<sub>d</sub>
</italic>
(<italic>x</italic>
<sub>1</sub>
, <italic>x</italic>
<sub>1</sub>
′)|.</p>
<table-wrap id="pone-0048442-t001" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0048442.t001</object-id>
<label>Table 1</label>
<caption><title><italic>R</italic>
(<italic>x</italic>
, <italic>x</italic>
′) and |<italic>M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′)| for the instance (15, 4).</title>
</caption>
<alternatives><graphic id="pone-0048442-t001-1" xlink:href="pone.0048442.t001"></graphic>
<table frame="hsides" rules="groups"><colgroup span="1"><col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead><tr><td align="left" rowspan="1" colspan="1">d<sub>H</sub>
(x, x′)</td>
<td align="left" rowspan="1" colspan="1">R(x, x′)</td>
<td align="left" rowspan="1" colspan="1">|M<sub>d</sub>
(x, x′)|</td>
</tr>
</thead>
<tbody><tr><td align="left" rowspan="1" colspan="1">9–15</td>
<td align="left" rowspan="1" colspan="1">Ф</td>
<td align="left" rowspan="1" colspan="1">0</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">8</td>
<td align="left" rowspan="1" colspan="1">{<0,0>}</td>
<td align="left" rowspan="1" colspan="1">70</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">7</td>
<td align="left" rowspan="1" colspan="1">{<0,0>, <0,1>}</td>
<td align="left" rowspan="1" colspan="1">350</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">6</td>
<td align="left" rowspan="1" colspan="1">{<0,0>, <0,1>, <0,2>, <1,0>}</td>
<td align="left" rowspan="1" colspan="1">1190</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">5</td>
<td align="left" rowspan="1" colspan="1">{<0,0>, <0,1>, <0,2>, <0,3>, <1,0>, <1,1>}</td>
<td align="left" rowspan="1" colspan="1">2970</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">4</td>
<td align="left" rowspan="1" colspan="1">{<0,0>, <0,1>, <0,2>, <0,3>, <0,4>, <1,0>, <1,1>, <1,2>, <2,0>}</td>
<td align="left" rowspan="1" colspan="1">6600</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">3</td>
<td align="left" rowspan="1" colspan="1">{<0,0>, <0,1>, <0,2>, <0,3>, <1,0>, <1,1>, <1,2>, <1,3>, <2,0>, <2,1>}</td>
<td align="left" rowspan="1" colspan="1">13504</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">2</td>
<td align="left" rowspan="1" colspan="1">{<0,0>, <0,1>, <0,2>, <1,0>, <1,1>, <1,2>, <2,0>, <2,1>, <2,2>, <3,0>}</td>
<td align="left" rowspan="1" colspan="1">27316</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">1</td>
<td align="left" rowspan="1" colspan="1">{<0,0>, <0,1>, <1,0>, <1,1>, <2,0>, <2,1>, <3,0>, <3,1>}</td>
<td align="left" rowspan="1" colspan="1">42760</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">0</td>
<td align="left" rowspan="1" colspan="1">{<0,0>, <1,0>, <2,0>, <3,0>, <4,0>}</td>
<td align="left" rowspan="1" colspan="1">100636</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot><fn id="nt101"><label></label>
<p>This table shows the values of <italic>R</italic>
(<italic>x</italic>
, <italic>x</italic>
′) and |<italic>M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′)| for the instance (15, 4) under different Hamming distances.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>Based on this observation, given an <italic>l</italic>
-mer <italic>x</italic>
 in <italic>s</italic>
<sub>1</sub>
, we can analyze how the number of candidate motifs changes with different <italic>s<sub>i</sub>
</italic>
 in <italic>S</italic>
 - {<italic>s</italic>
<sub>1</sub>
}. On the one hand, there are a relatively small number of candidate motifs, if all <italic>l</italic>
-mers in <italic>s<sub>i</sub>
</italic>
 are at a relatively large Hamming distance from <italic>x</italic>
. On the other hand, <italic>x</italic>
 and <italic>s<sub>i</sub>
</italic>
 will lead to a huge number of candidate motifs, once there are several <italic>l</italic>
-mers in <italic>s<sub>i</sub>
</italic>
 at a very small Hamming distance from <italic>x</italic>
. Our selection method limits the total candidate volume by preventing the occurrence of the latter case.</p>
</sec>
</sec>
<sec id="s2f"><title>Stage 2: Filtering l-mers</title>
<p>In PairMotif, for each selected pair of <italic>l</italic>
-mers <italic>x</italic>
 and <italic>x</italic>
′, two filtering rules below are used to determine if each <italic>l</italic>
-mer <italic>z</italic>
 in <italic>S</italic>
 - {<italic>s</italic>
<sub>1</sub>
, <italic>s<sub>r</sub>
</italic>
} is a possible instance of a certain candidate motif in <italic>M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′):</p>
<sec id="s2f1"><title>Rule 1</title>
<p>If either <italic>d<sub>H</sub>
</italic>
(<italic>z</italic>
, <italic>x</italic>
) >2<italic>d</italic>
 or <italic>d<sub>H</sub>
</italic>
(<italic>z</italic>
, <italic>x</italic>
′) >2<italic>d</italic>
, then <italic>z</italic>
 is not an instance of any candidate motif in <italic>M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′).</p>
</sec>
<sec id="s2f2"><title>Rule 2</title>
<p>If there is no such a 2-tuple <<italic>α</italic>
, <italic>β</italic>
> ∈ <italic>R</italic>
(<italic>x</italic>
, <italic>x</italic>
′) that satisfies <italic>abs</italic>
(|<italic>P</italic>
<sub>10</sub>
(<italic>x</italic>
, <italic>x</italic>
′, <italic>z</italic>
)|−<italic>α</italic>
) + <italic>abs</italic>
(|<italic>P</italic>
<sub>00</sub>
(<italic>x</italic>
, <italic>x</italic>
′, <italic>z</italic>
)|−<italic>β</italic>
) ≤ <italic>d</italic>
, then <italic>z</italic>
 is not an instance of any candidate motif in <italic>M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′). The symbol <italic>abs</italic>
(·) denotes the operation of taking the absolute value.</p>
<p>Rule 1 employs the widely used criterion that two instances of the same motif must not differ by more than 2<italic>d</italic>
 differences. Rule 2 realizes filtration from a different perspective (the proof of its correctness is included in the <xref ref-type="supplementary-material" rid="pone.0048442.s001">Text S1</xref>
): it compares the distance with <italic>d</italic>
 in the case that there is an error value <<italic>α</italic>
, <italic>β</italic>
>. Therefore, Rule 2 can filter out some <italic>l</italic>
-mers that cannot be filtered out by Rule 1. Let us take the instance (15, 4) as an example. Assume that there are three <italic>l</italic>
-mers <italic>x</italic>
 (<named-content content-type="gene">AAAAAAAGGGGGGGG</named-content>
), <italic>x</italic>
′ (<named-content content-type="gene">AAAAAAACCCCCCCC</named-content>
) and <italic>z</italic>
 (<named-content content-type="gene">AAAAAAATTTTTTTT</named-content>
). By Rule 2, <italic>z</italic>
 can be filtered out successfully since <italic>R</italic>
(<italic>x</italic>
, <italic>x</italic>
′) = {<0, 0>} and <italic>abs</italic>
(|<italic>P</italic>
<sub>10</sub>
(<italic>x</italic>
, <italic>x</italic>
′, <italic>z</italic>
)|−0) + <italic>abs</italic>
(|<italic>P</italic>
<sub>00</sub>
(<italic>x</italic>
, <italic>x</italic>
′, <italic>z</italic>
)|−0) = 0+8 = 8> <italic>d</italic>
. But Rule 1 is invalid since <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′) = <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
, <italic>z</italic>
) = <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
′, <italic>z</italic>
) = 8≤2<italic>d</italic>
. However, there also exist some <italic>l</italic>
-mers that can be filtered out by Rule 1, but cannot be filtered out by Rule 2. For example, keep <italic>x</italic>
 and <italic>x</italic>
′ unchanged, and let <italic>z = </italic>
TTTTAAAGGGGGGGG. By Rule 1, <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
′, <italic>z</italic>
) = 12>2<italic>d</italic>
, so <italic>z</italic>
 is filtered out. But by Rule 2, <italic>z</italic>
 cannot be filtered out since <italic>abs</italic>
(|<italic>P</italic>
<sub>10</sub>
(<italic>x</italic>
, <italic>x</italic>
′, <italic>z</italic>
)|−0) + <italic>abs</italic>
(|<italic>P</italic>
<sub>00</sub>
(<italic>x</italic>
, <italic>x</italic>
′, <italic>z</italic>
)|−0) = 4+0 = 4≤ <italic>d</italic>
. Taking these considerations into account, we use these two rules to simultaneously perform filtration.</p>
<p>Moreover, the fact that most pairs of <italic>l</italic>
-mers selected in Stage 1 have relatively large Hamming distances is conducive to filtering out more <italic>l</italic>
-mers. This can be understood through Observation 1. The larger the value of <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′), the smaller the value of <italic>|M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′)|. Accordingly, given a random <italic>l</italic>
-mer <italic>z</italic>
, the probability that <italic>z</italic>
 is one of the <italic>d</italic>
-neighbors of <italic>M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′) is relatively small when <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′) is relatively large.</p>
<p>On the basis above, most <italic>l</italic>
-mers, which need to be compared with each candidate motif in Stage 3, are filtered out in advance.</p>
</sec>
</sec>
<sec id="s2g"><title>Stage 3: Verifying Motifs</title>
<p>For each selected pair of <italic>l</italic>
-mers <italic>x</italic>
 and <italic>x</italic>
′, candidate motifs in <italic>M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′) need to be traversed to perform candidate verification. This section gives a deterministic and efficient traversing method, rather than enumerating all possible <italic>l</italic>
-mers.</p>
<p>At first, we discuss how to compute <italic>R</italic>
(<italic>x</italic>
, <italic>x</italic>
′) given a pair of <italic>l</italic>
-mers <italic>x</italic>
 and <italic>x</italic>
′; <italic>R</italic>
(<italic>x</italic>
, <italic>x</italic>
′) implies the approach to traversing candidate motifs in <italic>M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′). Assume that <italic>y</italic>
 is a candidate motif in <italic>M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′) with <italic>R</italic>
(<italic>x</italic>
, <italic>x</italic>
′, <italic>y</italic>
) = <<italic>α</italic>
, <italic>β</italic>
>. By Definition 2, we have <italic>α</italic>
 ≤ <italic>l</italic>
−<italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′), <italic>β</italic>
 ≤ <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′) and <italic>d<sub>H</sub>
</italic>
(<italic>y</italic>
, <italic>x</italic>
) + <italic>d<sub>H</sub>
</italic>
(<italic>y</italic>
, <italic>x</italic>
′) = 2<italic>α</italic>
 +2<italic>β</italic>
 + (<italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′)−<italic>β</italic>
). Furthermore, we have <italic>d<sub>H</sub>
</italic>
(<italic>y</italic>
, <italic>x</italic>
) + <italic>d<sub>H</sub>
</italic>
(<italic>y</italic>
, <italic>x</italic>
′) ≤2<italic>d</italic>
 because both <italic>x</italic>
 and <italic>x</italic>
′ are instances of <italic>y</italic>
. Thus, we obtain:<disp-formula id="pone.0048442.e004"><graphic xlink:href="pone.0048442.e004.jpg" position="anchor" orientation="portrait"></graphic>
<label>(2)</label>
</disp-formula>
</p>
<p>Obviously, the value of <italic>R</italic>
(<italic>x</italic>
, <italic>x</italic>
′) is determined by <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′). Given the value of <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′), it is easy to compute <italic>R</italic>
(<italic>x</italic>
, <italic>x</italic>
′) for the specified (<italic>l</italic>
, <italic>d</italic>
) instance by listing all 2-tuples <<italic>α</italic>
, <italic>β</italic>
> satisfying (2). <xref ref-type="table" rid="pone-0048442-t001">Table 1</xref>
 shows <italic>R</italic>
(<italic>x</italic>
, <italic>x</italic>
′) for the instance (15, 4) under different Hamming distances.</p>
<p>Based on different 2-tuples in <italic>R</italic>
(<italic>x</italic>
, <italic>x</italic>
′), candidate motifs in <italic>M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′) can be traversed in a simple way. <italic>M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′) consists of several mutually disjoint subsets with each subset sharing a different 2-tuple in <italic>R</italic>
(<italic>x</italic>
, <italic>x</italic>
′); these subsets are visited one by one. As shown in <xref ref-type="fig" rid="pone-0048442-g003">Figure 3</xref>
, for each <<italic>α</italic>
, <italic>β</italic>
> in <italic>R</italic>
(<italic>x</italic>
, <italic>x</italic>
′), let <italic>y</italic>
 = <italic>x</italic>
′, and then we traverse candidate motifs in the associated subset of <italic>M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′) by changing <italic>y</italic>
 with the following three steps. First, select <italic>α</italic>
 positions from <italic>P</italic>
<sub>1</sub>
(<italic>x</italic>
, <italic>x</italic>
′), and for each <italic>i</italic>
 of these <italic>α</italic>
 positions, change <italic>y</italic>
[<italic>i</italic>
] to one of the three characters different from <italic>x</italic>
[<italic>i</italic>
]; second, select <italic>β</italic>
 positions from <italic>P</italic>
<sub>0</sub>
(<italic>x</italic>
, <italic>x</italic>
′), and for each <italic>i</italic>
 of these <italic>β</italic>
 positions, change <italic>y</italic>
[<italic>i</italic>
] to one of the two characters different from <italic>x</italic>
[<italic>i</italic>
] and <italic>x</italic>
′ [<italic>i</italic>
]; third, select a part of positions from <italic>P</italic>
<sub>0</sub>
(<italic>x</italic>
, <italic>x</italic>
′) except for those selected in step 2, and change <italic>y</italic>
[<italic>i</italic>
] to <italic>x</italic>
[<italic>i</italic>
] for each <italic>i</italic>
 of these positions. Note that, in step 3, the number of selected positions should ensure <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
, <italic>y</italic>
) ≤ <italic>d</italic>
 and <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
′, <italic>y</italic>
) ≤ <italic>d</italic>
.</p>
<fig id="pone-0048442-g003" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0048442.g003</object-id>
<label>Figure 3</label>
<caption><title>An example for traversing candidate motifs in <italic>M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′).</title>
<p>This figure shows an example for traversing candidate motifs shared by two <italic>l</italic>
-mers <italic>x</italic>
 and <italic>x</italic>
′. After calculating <italic>R</italic>
(<italic>x</italic>
, <italic>x</italic>
′), for each <<italic>α</italic>
, <italic>β</italic>
> in <italic>R</italic>
(<italic>x</italic>
, <italic>x</italic>
′), let <italic>y</italic>
 = <italic>x</italic>
′, and the process of traversing is implemented by changing <italic>y</italic>
 with three steps. First, select <italic>α</italic>
 positions from the positions where <italic>x</italic>
[<italic>i</italic>
] = <italic>x</italic>
′ [<italic>i</italic>
], and for each <italic>i</italic>
 of these <italic>α</italic>
 positions, change <italic>y</italic>
[<italic>i</italic>
] to one of the three characters different from <italic>x</italic>
[<italic>i</italic>
]. Second, select <italic>β</italic>
 positions from the positions where <italic>x</italic>
[<italic>i</italic>
] ≠<italic>x</italic>
′ [<italic>i</italic>
], and for each <italic>i</italic>
 of these <italic>β</italic>
 positions, change <italic>y</italic>
[<italic>i</italic>
] to one of the two characters different from <italic>x</italic>
[<italic>i</italic>
] and <italic>x</italic>
′ [<italic>i</italic>
]. Third, select a part of positions from the positions where <italic>x</italic>
[<italic>i</italic>
] ≠<italic>x</italic>
′ [<italic>i</italic>
] except for those selected in step 2, and change <italic>y</italic>
[<italic>i</italic>
] to <italic>x</italic>
[<italic>i</italic>
] for each <italic>i</italic>
 of these positions. The bold italic characters denote the changed positions in <italic>y</italic>
.</p>
</caption>
<graphic xlink:href="pone.0048442.g003"></graphic>
</fig>
<p>According to the process of traversing candidate motifs, the size of <italic>M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′) is calculated as follows:<disp-formula id="pone.0048442.e005"><graphic xlink:href="pone.0048442.e005.jpg" position="anchor" orientation="portrait"></graphic>
<label>(3)</label>
</disp-formula>
</p>
<p>Moreover, as the process of verifying candidate motifs is to frequently calculate the Hamming distance between two <italic>l</italic>
-mers, an efficient method is used in PairMotif to calculate <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′). First, convert <italic>x</italic>
 and <italic>x</italic>
′ to integers <italic>x<sub>b</sub>
</italic>
 and <italic>x</italic>
′<italic><sub>b</sub>
</italic>
 by encoding each character of <italic>x</italic>
 and <italic>x</italic>
′ to a 2-bit. Second, compute the bitwise exclusive disjunction of <italic>x<sub>b</sub>
</italic>
 and <italic>x</italic>
′<italic><sub>b</sub>
</italic>
, denoted by <italic>X</italic>
. Third, obtain <italic>d<sub>H</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′), which is the number of 2-bits that are not 00 in <italic>X</italic>
, by searching a 256 byte table ⌈<italic>l</italic>
/4⌉ times. At each position of the table, we cache the number of 2-bits that are not 00 in the associated 8-bit. In the implementation of PairMotif, all <italic>l</italic>
-mers in <italic>S</italic>
 are represented as integers and cached for skipping the first step of this method. Therefore, in practice, this method is four times faster than comparing two <italic>l</italic>
-mers directly. The details of this method and the evaluation of its speedup are included in the <xref ref-type="supplementary-material" rid="pone.0048442.s002">Text S2</xref>
.</p>
<table-wrap id="pone-0048442-t002" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0048442.t002</object-id>
<label>Table 2</label>
<caption><title>Algorithm complexities.</title>
</caption>
<alternatives><graphic id="pone-0048442-t002-2" xlink:href="pone.0048442.t002"></graphic>
<table frame="hsides" rules="groups"><colgroup span="1"><col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead><tr><td align="left" rowspan="1" colspan="1">Algorithm</td>
<td align="left" rowspan="1" colspan="1">Time complexity</td>
<td align="left" rowspan="1" colspan="1">Space complexity</td>
</tr>
</thead>
<tbody><tr><td align="left" rowspan="1" colspan="1">PMSprune</td>
<td align="left" rowspan="1" colspan="1"><italic>O</italic>
(<italic>tn</italic>
<sup>2</sup>
4<italic><sup>l</sup>
p<sub>d</sub>
</italic>
)</td>
<td align="left" rowspan="1" colspan="1"><italic>O</italic>
(<italic>tn</italic>
<sup>2</sup>
)</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">iTriplet</td>
<td align="left" rowspan="1" colspan="1"><italic>O</italic>
(<italic>tn</italic>
<sup>3</sup>
<italic>l</italic>
<sup>2</sup>
<italic>d</italic>
<sup>3</sup>
<italic>p</italic>
<sub>2<italic>d</italic>
</sub>
<sup>3</sup>
)</td>
<td align="left" rowspan="1" colspan="1"><italic>O</italic>
(4<italic><sup>l</sup>
p<sub>d</sub>
</italic>
)</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">PMS5</td>
<td align="left" rowspan="1" colspan="1"><italic>O</italic>
(<italic>L</italic>
+<italic>tn</italic>
<sup>3</sup>
<italic>d</italic>
4<italic><sup>l</sup>
p<sub>d</sub>
</italic>
<sup>3</sup>
)</td>
<td align="left" rowspan="1" colspan="1"><italic>O</italic>
(<italic>l</italic>
<sup>5</sup>
<italic>d</italic>
<sup>3</sup>
)</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">PairMotif</td>
<td align="left" rowspan="1" colspan="1"><italic>O</italic>
(<italic>tn</italic>
<sup>3</sup>
4<italic><sup>l</sup>
p<sub>d</sub>
</italic>
<sup>2</sup>
<italic>p</italic>
<sub>2<italic>d</italic>
</sub>
<sup>2</sup>
)</td>
<td align="left" rowspan="1" colspan="1"><italic>O</italic>
(<italic>tn</italic>
)</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot><fn id="nt102"><label></label>
<p>This table shows the time and space complexities of PairMotif and that of other famous exact algorithms. Note that, <italic>t</italic>
 is the number of sequences; <italic>n</italic>
 is the sequence length; <italic>p<sub>k</sub>
</italic>
 is the probability that the Hamming distance between two random <italic>l</italic>
-mers is not more than <italic>k</italic>
; <italic>L</italic>
 represents the time to load the ILP table of PMS5, which is about 50 seconds <xref rid="pone.0048442-Dinh1" ref-type="bibr">[22]</xref>
.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
<sec id="s3"><title>Results and Discussion</title>
<p>We mainly compare the time performance of PairMotif with that of other famous exact algorithms, since all exact algorithms report the same results with different time overheads.</p>
<table-wrap id="pone-0048442-t003" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0048442.t003</object-id>
<label>Table 3</label>
<caption><title>Time comparison on fixed 2<italic>d</italic>
 neighborhood probability.</title>
</caption>
<alternatives><graphic id="pone-0048442-t003-3" xlink:href="pone.0048442.t003"></graphic>
<table frame="hsides" rules="groups"><colgroup span="1"><col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead><tr><td align="left" rowspan="1" colspan="1">Algorithm</td>
<td align="left" rowspan="1" colspan="1">(12, 3)</td>
<td align="left" rowspan="1" colspan="1">(15, 4)</td>
<td align="left" rowspan="1" colspan="1">(18, 5)</td>
<td align="left" rowspan="1" colspan="1">(21, 6)</td>
<td align="left" rowspan="1" colspan="1">(24, 7)</td>
<td align="left" rowspan="1" colspan="1">(27, 8)</td>
<td align="left" rowspan="1" colspan="1">(30, 9)</td>
</tr>
</thead>
<tbody><tr><td align="left" rowspan="1" colspan="1">RISOTTO</td>
<td align="left" rowspan="1" colspan="1">25 s</td>
<td align="left" rowspan="1" colspan="1">3.8 m</td>
<td align="left" rowspan="1" colspan="1">30.3 m</td>
<td align="left" rowspan="1" colspan="1">4.1 h</td>
<td align="left" rowspan="1" colspan="1">-o</td>
<td align="left" rowspan="1" colspan="1">-o</td>
<td align="left" rowspan="1" colspan="1">-o</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">PMS5</td>
<td align="left" rowspan="1" colspan="1">17 s</td>
<td align="left" rowspan="1" colspan="1">28 s</td>
<td align="left" rowspan="1" colspan="1">2.4 m</td>
<td align="left" rowspan="1" colspan="1">2.5 m</td>
<td align="left" rowspan="1" colspan="1">2.4 m</td>
<td align="left" rowspan="1" colspan="1">-e</td>
<td align="left" rowspan="1" colspan="1">-e</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">iTriplet</td>
<td align="left" rowspan="1" colspan="1">2.9 m</td>
<td align="left" rowspan="1" colspan="1">3.1 m</td>
<td align="left" rowspan="1" colspan="1">3.8 m</td>
<td align="left" rowspan="1" colspan="1">4.2 m</td>
<td align="left" rowspan="1" colspan="1">4.9 m</td>
<td align="left" rowspan="1" colspan="1">5.9 m</td>
<td align="left" rowspan="1" colspan="1">7.4 m</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">PMSprune</td>
<td align="left" rowspan="1" colspan="1">1 s</td>
<td align="left" rowspan="1" colspan="1">2 s</td>
<td align="left" rowspan="1" colspan="1">6 s</td>
<td align="left" rowspan="1" colspan="1">11 s</td>
<td align="left" rowspan="1" colspan="1">19 s</td>
<td align="left" rowspan="1" colspan="1">35 s</td>
<td align="left" rowspan="1" colspan="1">50 s</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">PairMotif</td>
<td align="left" rowspan="1" colspan="1">2 s</td>
<td align="left" rowspan="1" colspan="1">2 s</td>
<td align="left" rowspan="1" colspan="1">3 s</td>
<td align="left" rowspan="1" colspan="1">5 s</td>
<td align="left" rowspan="1" colspan="1">11 s</td>
<td align="left" rowspan="1" colspan="1">24 s</td>
<td align="left" rowspan="1" colspan="1">47 s</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot><fn id="nt103"><label></label>
<p>Time units, s: seconds; m: minutes; h: hours. Note, -o: over 10 hours; -e: memory error.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<sec id="s3a"><title>Time and Space Analysis</title>
<p>Let <italic>p<sub>k</sub>
</italic>
 denote the probability that the Hamming distance between two random <italic>l</italic>
-mers is not more than <italic>k</italic>
 (0< <italic>k</italic>
 < <italic>l</italic>
):<disp-formula id="pone.0048442.e006"><graphic xlink:href="pone.0048442.e006.jpg" position="anchor" orientation="portrait"></graphic>
<label>(4)</label>
</disp-formula>
</p>
<p>The time complexity of PairMotif is analyzed by estimating the number of candidate motifs and the number of <italic>l</italic>
-mers to be scanned in verifying each candidate motif. PairMotif loops through <italic>O</italic>
(<italic>n</italic>
<sup>2</sup>
) pairs of <italic>l</italic>
-mers. For each pair of <italic>l</italic>
-mers <italic>x</italic>
 and <italic>x</italic>
′, the probability that a random <italic>l</italic>
-mer <italic>y</italic>
 becomes a candidate motif is Prob.[<italic>d<sub>H</sub>
</italic>
(<italic>y</italic>
, <italic>x</italic>
) ≤ <italic>d</italic>
 & <italic>d<sub>H</sub>
</italic>
(<italic>y</italic>
, <italic>x</italic>
′) ≤ <italic>d</italic>
] = <inline-formula><inline-graphic xlink:href="pone.0048442.e007.jpg"></inline-graphic>
</inline-formula>
, so the size of <italic>M<sub>d</sub>
</italic>
(<italic>x</italic>
, <italic>x</italic>
′) is approximately equal to <inline-formula><inline-graphic xlink:href="pone.0048442.e008.jpg"></inline-graphic>
</inline-formula>
. Thus, the total number of candidate motifs is<inline-formula><inline-graphic xlink:href="pone.0048442.e009.jpg"></inline-graphic>
</inline-formula>
. Furthermore, there are <italic>O</italic>
(<italic>tn</italic>
) potential <italic>l</italic>
-mers to be scanned in verifying each candidate motif. After being filtered by both Rule 1 and Rule 2, each remaining <italic>l</italic>
-mer <italic>z</italic>
 satisfies at least the condition that <italic>d<sub>H</sub>
</italic>
(<italic>z</italic>
, <italic>x</italic>
) ≤2<italic>d</italic>
 and <italic>d<sub>H</sub>
</italic>
(<italic>z</italic>
, <italic>x</italic>
′) ≤2<italic>d</italic>
. Hence, the number of <italic>l</italic>
-mers to be scanned is <inline-formula><inline-graphic xlink:href="pone.0048442.e010.jpg"></inline-graphic>
</inline-formula>
. Based on the above considerations, the time complexity of PairMotif is <inline-formula><inline-graphic xlink:href="pone.0048442.e011.jpg"></inline-graphic>
</inline-formula>
.</p>
<table-wrap id="pone-0048442-t004" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0048442.t004</object-id>
<label>Table 4</label>
<caption><title>Time comparison on different 2<italic>d</italic>
 neighborhood probability.</title>
</caption>
<alternatives><graphic id="pone-0048442-t004-4" xlink:href="pone.0048442.t004"></graphic>
<table frame="hsides" rules="groups"><colgroup span="1"><col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead><tr><td align="left" rowspan="1" colspan="1">(<italic>l</italic>
, <italic>d</italic>
)</td>
<td align="left" rowspan="1" colspan="1">Neighborhood Probability</td>
<td align="left" rowspan="1" colspan="1">RISOTTO</td>
<td align="left" rowspan="1" colspan="1">PMS5</td>
<td align="left" rowspan="1" colspan="1">iTriplet</td>
<td align="left" rowspan="1" colspan="1">PMSprune</td>
<td align="left" rowspan="1" colspan="1">PairMotif</td>
</tr>
</thead>
<tbody><tr><td align="left" rowspan="1" colspan="1">(29, 8)</td>
<td align="left" rowspan="1" colspan="1">0.016</td>
<td align="left" rowspan="1" colspan="1">-o</td>
<td align="left" rowspan="1" colspan="1">-e</td>
<td align="left" rowspan="1" colspan="1">21 s</td>
<td align="left" rowspan="1" colspan="1">1 s</td>
<td align="left" rowspan="1" colspan="1">1 s</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">(9, 2)</td>
<td align="left" rowspan="1" colspan="1">0.049</td>
<td align="left" rowspan="1" colspan="1">3 s</td>
<td align="left" rowspan="1" colspan="1">14 s</td>
<td align="left" rowspan="1" colspan="1">2.6 m</td>
<td align="left" rowspan="1" colspan="1">1 s</td>
<td align="left" rowspan="1" colspan="1">1 s</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">(23, 7)</td>
<td align="left" rowspan="1" colspan="1">0.078</td>
<td align="left" rowspan="1" colspan="1">-o</td>
<td align="left" rowspan="1" colspan="1">2.6 m</td>
<td align="left" rowspan="1" colspan="1">19.3 m</td>
<td align="left" rowspan="1" colspan="1">2.3 m</td>
<td align="left" rowspan="1" colspan="1">2.2 m</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">(28, 9)</td>
<td align="left" rowspan="1" colspan="1">0.138</td>
<td align="left" rowspan="1" colspan="1">-o</td>
<td align="left" rowspan="1" colspan="1">-e</td>
<td align="left" rowspan="1" colspan="1">3.6 h</td>
<td align="left" rowspan="1" colspan="1">1.1 h</td>
<td align="left" rowspan="1" colspan="1">2.9 h</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">(19, 6)</td>
<td align="left" rowspan="1" colspan="1">0.175</td>
<td align="left" rowspan="1" colspan="1">7.5 h</td>
<td align="left" rowspan="1" colspan="1">3.0 m</td>
<td align="left" rowspan="1" colspan="1">1.9 h</td>
<td align="left" rowspan="1" colspan="1">5.9 m</td>
<td align="left" rowspan="1" colspan="1">4.0 m</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">(27, 9)</td>
<td align="left" rowspan="1" colspan="1">0.213</td>
<td align="left" rowspan="1" colspan="1">-o</td>
<td align="left" rowspan="1" colspan="1">-e</td>
<td align="left" rowspan="1" colspan="1">-o</td>
<td align="left" rowspan="1" colspan="1">-o</td>
<td align="left" rowspan="1" colspan="1">7.9 h</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">(18, 6)</td>
<td align="left" rowspan="1" colspan="1">0.283</td>
<td align="left" rowspan="1" colspan="1">-o</td>
<td align="left" rowspan="1" colspan="1">7.1 m</td>
<td align="left" rowspan="1" colspan="1">-o</td>
<td align="left" rowspan="1" colspan="1">29.6 m</td>
<td align="left" rowspan="1" colspan="1">12.1 m</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">(15, 5)</td>
<td align="left" rowspan="1" colspan="1">0.319</td>
<td align="left" rowspan="1" colspan="1">1.3 h</td>
<td align="left" rowspan="1" colspan="1">4.1 m</td>
<td align="left" rowspan="1" colspan="1">-o</td>
<td align="left" rowspan="1" colspan="1">8.7 m</td>
<td align="left" rowspan="1" colspan="1">4.7 m</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">(17, 6)</td>
<td align="left" rowspan="1" colspan="1">0.426</td>
<td align="left" rowspan="1" colspan="1">-o</td>
<td align="left" rowspan="1" colspan="1">31.3 m</td>
<td align="left" rowspan="1" colspan="1">-o</td>
<td align="left" rowspan="1" colspan="1">1.8 h</td>
<td align="left" rowspan="1" colspan="1">53.3 m</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">(19, 7)</td>
<td align="left" rowspan="1" colspan="1">0.534</td>
<td align="left" rowspan="1" colspan="1">-o</td>
<td align="left" rowspan="1" colspan="1">1.4 h</td>
<td align="left" rowspan="1" colspan="1">-o</td>
<td align="left" rowspan="1" colspan="1">-o</td>
<td align="left" rowspan="1" colspan="1">8.6 h</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot><fn id="nt104"><label></label>
<p>Time units, s: seconds; m: minutes; h: hours. Note, -o: over 10 hours; -e: memory error.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>PairMotif requires little storage for implementation. In PairMotif, all <italic>l</italic>
-mers in <italic>s</italic>
<sub>1</sub>
 are traversed with each of them processed independently. After processing one <italic>l</italic>
-mer, the associated memory can be released, and we just need to consider the memory requirement for processing one <italic>l</italic>
-mer <italic>x</italic>
. Specifically, we store the (<italic>t−</italic>
1)(<italic>n-l+1</italic>
) <italic>l</italic>
-mers in <italic>s</italic>
<sub>2</sub>
, …, <italic>s<sub>t</sub>
</italic>
 and the Hamming distances from <italic>x</italic>
 to these <italic>l</italic>
-mers. Therefore, the space complexity of PairMotif is <italic>O</italic>
(<italic>tn</italic>
).</p>
<fig id="pone-0048442-g004" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0048442.g004</object-id>
<label>Figure 4</label>
<caption><title>Time comparison on different sequence lengths.</title>
<p>This figure compares PairMotif with two famous algorithms PMS5 <xref rid="pone.0048442-Dinh1" ref-type="bibr">[22]</xref>
 and PMSprune <xref rid="pone.0048442-Davila2" ref-type="bibr">[20]</xref>
 on different sequence lengths on the instance (18, 6). The x-axis shows the sequence lengths. The y-axis shows the running times.</p>
</caption>
<graphic xlink:href="pone.0048442.g004"></graphic>
</fig>
<p><xref ref-type="table" rid="pone-0048442-t002">Table 2</xref>
 shows the time and space complexities of PairMotif and those of several famous exact algorithms, such as PMSprune <xref rid="pone.0048442-Davila2" ref-type="bibr">[20]</xref>
, iTriplet <xref rid="pone.0048442-Ho1" ref-type="bibr">[21]</xref>
 and PMS5 <xref rid="pone.0048442-Dinh1" ref-type="bibr">[22]</xref>
. PairMotif requires the least amount of storage space. Particularly, the space complexity of PairMotif and that of PMSprune, which depend on <italic>t</italic>
 and <italic>n</italic>
, are fixed on different PMS instances; whereas the storage requirement of iTriplet and that of PMS5, which depend on <italic>l</italic>
 and <italic>d</italic>
, may be unrealistic on the PMS instances with large values of <italic>l</italic>
 and <italic>d</italic>
. For time complexity, PairMotif outperforms PMSprune because the ratio of their time complexities, <inline-formula><inline-graphic xlink:href="pone.0048442.e012.jpg"></inline-graphic>
</inline-formula>
, is less than 1. It also outperforms iTriplet on most PMS instances except for those with very large values of <italic>l</italic>
. Moreover, PairMotif shows its performance advantage over PMS5 when <italic>l</italic>
 is small (<italic>l</italic>
 <15); in this case, the time overhead of loading the ILP table becomes the limiting factor in the performance of PMS5.</p>
<table-wrap id="pone-0048442-t005" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0048442.t005</object-id>
<label>Table 5</label>
<caption><title>Experimental results on real biological data.</title>
</caption>
<alternatives><graphic id="pone-0048442-t005-5" xlink:href="pone.0048442.t005"></graphic>
<table frame="hsides" rules="groups"><colgroup span="1"><col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead><tr><td align="left" rowspan="1" colspan="1">Data set</td>
<td align="left" rowspan="1" colspan="1">(<italic>l</italic>
, <italic>d</italic>
) used</td>
<td align="left" rowspan="1" colspan="1">Amount of (<italic>l</italic>
, <italic>d</italic>
) motifs</td>
<td align="left" rowspan="1" colspan="1">Motif</td>
</tr>
</thead>
<tbody><tr><td align="left" rowspan="1" colspan="1">preproinsulin</td>
<td align="left" rowspan="1" colspan="1">(15, 2)</td>
<td align="left" rowspan="1" colspan="1">10<sup>2</sup>
</td>
<td align="left" rowspan="1" colspan="1">CAGCCTCAGCCCCCC<xref ref-type="table-fn" rid="nt105">a</xref>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">TG<bold><italic>CAGCCTCAGCCCC</italic>
</bold>
<xref ref-type="table-fn" rid="nt106">b</xref>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">GAAATTG<bold><italic>CAGCCTCA</italic>
</bold>
<xref ref-type="table-fn" rid="nt107">c</xref>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">TG<bold><italic>CAGCCTCAGCCCC</italic>
</bold>
<xref ref-type="table-fn" rid="nt108">d</xref>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">DHFR</td>
<td align="left" rowspan="1" colspan="1">(11, 2)</td>
<td align="left" rowspan="1" colspan="1">10<sup>3</sup>
</td>
<td align="left" rowspan="1" colspan="1">ATTTCGCGCCA<xref ref-type="table-fn" rid="nt105">a</xref>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">CATCGTCGCCG<xref ref-type="table-fn" rid="nt106">b</xref>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"><bold><italic>GCGCCA</italic>
</bold>
AACTT<xref ref-type="table-fn" rid="nt107">c</xref>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"><bold><italic>TCGCGCCA</italic>
</bold>
AAC<xref ref-type="table-fn" rid="nt108">d</xref>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">c-fos</td>
<td align="left" rowspan="1" colspan="1">(9, 2)</td>
<td align="left" rowspan="1" colspan="1">10<sup>4</sup>
</td>
<td align="left" rowspan="1" colspan="1">CCANATTNG<xref ref-type="table-fn" rid="nt105">a</xref>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">GCCTCCCCC<xref ref-type="table-fn" rid="nt106">b</xref>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">C<bold><italic>CTATTTGG</italic>
</bold>
A<sup>ce</sup>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">GTTGGCTGC<xref ref-type="table-fn" rid="nt108">d</xref>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">metallothionein</td>
<td align="left" rowspan="1" colspan="1">(15, 2)</td>
<td align="left" rowspan="1" colspan="1">10<sup>1</sup>
</td>
<td align="left" rowspan="1" colspan="1">CTCTGCACRCCGCCC<xref ref-type="table-fn" rid="nt105">a</xref>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"><bold><italic>TCTGCACCCGGCCC</italic>
</bold>
C<xref ref-type="table-fn" rid="nt106">b</xref>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"><bold><italic>CTCTGCACCCGGCAC</italic>
</bold>
<xref ref-type="table-fn" rid="nt107">c</xref>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"><bold><italic>TCTGCACCCGGCCC</italic>
</bold>
C<xref ref-type="table-fn" rid="nt108">d</xref>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Yeast ECB</td>
<td align="left" rowspan="1" colspan="1">(16, 3)</td>
<td align="left" rowspan="1" colspan="1">10<sup>1</sup>
</td>
<td align="left" rowspan="1" colspan="1">TTTCCCNNTNAGGAAA<xref ref-type="table-fn" rid="nt105">a</xref>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"><bold><italic>TTACCCATTTAGGAAA</italic>
</bold>
<xref ref-type="table-fn" rid="nt106">b</xref>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"><bold><italic>TTACCCATTAAGGAAA</italic>
</bold>
<xref ref-type="table-fn" rid="nt107">c</xref>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"><bold><italic>TTTCCCTTTTAGGAAA</italic>
</bold>
<xref ref-type="table-fn" rid="nt108">d</xref>
</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot><fn id="nt105"><label>a</label>
<p>The published motif.</p>
</fn>
<fn id="nt106"><label>b</label>
<p>The predicted motif obtained by using consensus score.</p>
</fn>
<fn id="nt107"><label>c</label>
<p>The predicted motif obtained by using relative entropy.</p>
</fn>
<fn id="nt108"><label>d</label>
<p>The predicted motif obtained by using sequence specificity.</p>
</fn>
<fn id="nt109"><label>e</label>
<p>The corresponding (<italic>l</italic>
, <italic>d</italic>
) used is (10, 2).</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="s3b"><title>Test on Simulated Data</title>
<p>We adopt the simulated data sets used in <xref rid="pone.0048442-Buhler1" ref-type="bibr">[6]</xref>
. Generate a motif of length <italic>l</italic>
 and <italic>t</italic>
 identically distributed sequences of length <italic>n</italic>
. Then, for each sequence <italic>s</italic>
, implant a random motif instance, which differs from the motif in at most <italic>d</italic>
 positions, to a random position in <italic>s</italic>
.</p>
<fig id="pone-0048442-g005" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pone.0048442.g005</object-id>
<label>Figure 5</label>
<caption><title>Comparison of predicted motifs under different objective functions.</title>
<p>The x-axis shows the data sets used in our experiments. For each data set, we obtain three predicted motifs in terms of three objective functions. The y-axis shows the value of nucleotide-level correlation coefficient for each predicted motif.</p>
</caption>
<graphic xlink:href="pone.0048442.g005"></graphic>
</fig>
<p>In experiments, we fix <italic>t</italic>
 = 20 and <italic>n</italic>
 = 600, and compare the performance of PairMotif with that of several recently proposed exact algorithms, such as RISOTTO <xref rid="pone.0048442-Pisanti1" ref-type="bibr">[18]</xref>
, PMSprune <xref rid="pone.0048442-Davila2" ref-type="bibr">[20]</xref>
, iTriplet <xref rid="pone.0048442-Ho1" ref-type="bibr">[21]</xref>
 and PMS5 <xref rid="pone.0048442-Dinh1" ref-type="bibr">[22]</xref>
, by varying <italic>l</italic>
 and <italic>d</italic>
 values (PMS instances). For the motif length <italic>l</italic>
, we consider its value ranging from 9 to 30, as the binding sites are short DNA segments. To select a group of PMS instances to carry out comparisons, we use the 2<italic>d</italic>
 neighborhood probability <italic>p</italic>
<sub>2<italic>d</italic>
</sub>
 calculated by (4) to indicate the weakness of a PMS instance. The larger the value of <italic>p</italic>
<sub>2<italic>d</italic>
</sub>
, the weaker the corresponding PMS instance. All algorithms are performed in the same experimental environment with a 2.67 GHz processor and a 4 Gbyte memory. And the average running times are derived by executing different algorithms on ten simulated data sets.</p>
<p>At first, the comparisons are carried out on fixed 2<italic>d</italic>
 neighborhood probability <italic>p</italic>
<sub>2<italic>d</italic>
</sub>
. <xref ref-type="table" rid="pone-0048442-t003">Table 3</xref>
 gives the running times of these algorithms on the PMS instances with the value of <italic>p</italic>
<sub>2<italic>d</italic>
</sub>
 around 0.05, which is approximately the same as the <italic>p</italic>
<sub>2<italic>d</italic>
</sub>
 value of the instance (15, 4). The results confirm with the time complexities in <xref ref-type="table" rid="pone-0048442-t002">Table 2</xref>
. PairMotif achieves the best execution time over other algorithms, and PMSprune is the second fastest. Although iTriplet exhibits stable running times, it does not show its performance advantage. PMS5 is defeated either because of the extra time overhead for loading ILP table or its high storage requirement <xref rid="pone.0048442-Dinh1" ref-type="bibr">[22]</xref>
. RISOTTO is very sensitive to the value of <italic>l</italic>
, and its running time increases exponentially when <italic>l</italic>
 increases.</p>
<p>Second, the running times of these algorithms are compared on a group of PMS instances with different probability <italic>p</italic>
<sub>2<italic>d</italic>
</sub>
 that ranges from 0.01 to 0.5. We do not consider the probability <italic>p</italic>
<sub>2<italic>d</italic>
</sub>
 whose value exceeds 0.5, as the corresponding motif is so degenerate that although exact algorithms can report all (<italic>l</italic>
, <italic>d</italic>
) motifs, it is difficult to distinguish the planted motif from spurious motifs. The running times are shown in <xref ref-type="table" rid="pone-0048442-t004">Table 4</xref>
. The performance of RISOTTO is the worst due to the strong sensitivity to the value of <italic>l</italic>
. The iTriplet algorithm fails to solve the PMS instances with <italic>p</italic>
<sub>2<italic>d</italic>
</sub>
 >0.2 because the value of <italic>p</italic>
<sub>2<italic>d</italic>
</sub>
 severely affects its time complexity. PMSprune can solve most of these PMS instances in a short time except for the instances (27, 9) and (19, 7). PMS5 works well on the PMS instances with large <italic>p</italic>
<sub>2<italic>d</italic>
</sub>
, whereas fails on the instances (29, 8), (28, 9) and (27, 9) because of memory limits. For PairMotif, all these PMS instances can be solved within 10 hours. Particularly, among these algorithms, only PairMotif can solve the instance (27, 9) while other algorithms fail because either a large amount of memory or execution time is required.</p>
<p>Moreover, it should be noticed that for a specific PMS instance, the longer the input sequences, the weaker the instance. It is therefore necessary to compare the performance of algorithms on different sequence length <italic>n</italic>
 given a PMS instance. To carry out comparisons, we select two algorithms PMS5 and PMSprune besides PairMotif, and then perform them on the well-known instance (18, 6) by varying <italic>n</italic>
 from 100 to 1000. <xref ref-type="fig" rid="pone-0048442-g004">Figure 4</xref>
 plots the running times of these algorithms against the increase of <italic>n</italic>
. The running time of PairMotif is almost linearly related to the sequence length; whereas the running time of PMS5 and that of PMSprune increase sharply as the sequence length increases, especially for PMSprune. The reason why the performance of PairMotif is stable over the sequence length is that PairMotif has strong ability of filtering scanned <italic>l</italic>
-mers. That is, the remaining <italic>l</italic>
-mers to be scanned after filtering are so few that their volume is almost unchanged for different sequence lengths. Neither PMS5 nor PMSprune possesses this property. Thus, although PairMotif does not outperform PMS5 when the input sequences are short, it does so when <italic>n</italic>
 >700.</p>
</sec>
<sec id="s3c"><title>Test on Biological Data</title>
<p>To test the validity of PairMotif, we identify the known transcription factor binding sites in five real data sets discussed in <xref rid="pone.0048442-Blanchette1" ref-type="bibr">[28]</xref>
, including preproinsulin, DHFR, c-fos, metallothionein and Yeast ECB (these data sets are included in the <xref ref-type="supplementary-material" rid="pone.0048442.s003">Dataset S1</xref>
). These data differ substantially from the simulated data.</p>
<p>PairMotif reports all (<italic>l</italic>
, <italic>d</italic>
) motifs. To obtain one predicted motif, a specific objective function is needed to rank the reported motifs, and the motif with maximum score is selected. In our experiments, three objective functions (consensus score <xref rid="pone.0048442-Jones1" ref-type="bibr">[12]</xref>
, relative entropy <xref rid="pone.0048442-Liu1" ref-type="bibr">[29]</xref>
 and sequence specificity <xref rid="pone.0048442-Pavesi1" ref-type="bibr">[30]</xref>
) suitable for exact algorithms are used to obtain three predicted motifs for each data set.</p>
<p><xref ref-type="table" rid="pone-0048442-t005">Table 5</xref>
 shows the results of our experiments. The (<italic>l</italic>
, <italic>d</italic>
) used for each data set is selected as follows: the value of <italic>l</italic>
 is fixed as the length of the published motif; the value of <italic>d</italic>
 is the minimum needed to ensure that the reported (<italic>l</italic>
, <italic>d</italic>
) motifs will contain the real motif. The third column gives the order of magnitude of the (<italic>l</italic>
, <italic>d</italic>
) motifs reported by PairMotif. The last column shows the predicted motifs selected from the reported (<italic>l</italic>
, <italic>d</italic>
) motifs by using three objective functions. The italicized part of each predicted motif indicates the part overlapped with the published motif.</p>
<p>PairMotif ensures that each predicted motif is optimal under the associated objective function. On this basis, the prediction accuracy depends on the used objective function. To evaluate each predicted motif, the nucleotide-level correlation coefficient (<italic>nCC</italic>
) <xref rid="pone.0048442-Tompa1" ref-type="bibr">[31]</xref>
 is adopted:<disp-formula id="pone.0048442.e013"><graphic xlink:href="pone.0048442.e013.jpg" position="anchor" orientation="portrait"></graphic>
<label>(5)</label>
</disp-formula>
where <italic>nTP</italic>
, <italic>nTN</italic>
, <italic>nFN</italic>
 and <italic>nFP</italic>
 are the number of true/false positive/negative predicted nucleotides. The use of correlation coefficient allows an integrated assessment of sensitivity and specificity.</p>
<p><xref ref-type="fig" rid="pone-0048442-g005">Figure 5</xref>
 compares the predicted motifs under different objective functions by showing their <italic>nCC</italic>
s. The value of <italic>nCC</italic>
 ranges from −1 (indicating perfect anticorrelation) to +1 (indicating perfect correlation); the larger the value of <italic>nCC</italic>
, the higher the accuracy of the predicted motif. For the five real data sets, the result accuracy under relative entropy and sequence specificity is obviously higher than that under consensus score. Nevertheless, no single objective function outperforms the others for every data set. For example, sequence specificity is well used for preproinsulin and DHFR, but leads to lower accuracy compared with relative entropy for c-fos and metallothionein. From the above analysis, PairMotif provides a good foundation for obtaining the real motif under a given objective function. If the objective function is able to assign the maximum score to the real motif, the real motif will be quickly found by PairMotif.</p>
</sec>
<sec id="s3d"><title>Conclusions</title>
<p>DNA motif search is a challenging problem in computer science and bioinformatics. In this paper, we propose a new combinatorial algorithm, PairMotif, which restricts the search space of motif search by generating candidate motifs from multiple pairs of <italic>l</italic>
-mers. PairMotif requires a small space complexity of <italic>O</italic>
(<italic>tn</italic>
), where <italic>t</italic>
 is the number of input sequences and <italic>n</italic>
 is the sequence length. It has a stable time performance over the sequence length, and it can solve most PMS instances in a reasonably short amount of time. Experimental results on real biological data sets show that PairMotif provides a good foundation for obtaining the real motif under a given objective function.</p>
<p>In PairMotif, all <italic>l</italic>
-mers in <italic>s</italic>
<sub>1</sub>
 are traversed and each <italic>l</italic>
-mer is processed independently. Therefore, PairMotif has a good feature of parallel computing, which allows it to use the strengths of parallel and distributed systems to improve the efficiency and quality of motif finding. Moreover, since all candidate motifs are traversed one by one in the current version of PairMotif, the performance of PairMotif can still be improved by using branch and bound in the process of traversing candidate motifs.</p>
</sec>
</sec>
<sec sec-type="supplementary-material" id="s4"><title>Supporting Information</title>
<supplementary-material content-type="local-data" id="pone.0048442.s001"><label>Text S1</label>
<caption><p><bold>The correctness of filtering rule 2.</bold>
</p>
<p>(DOC)</p>
</caption>
<media xlink:href="pone.0048442.s001.doc"><caption><p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0048442.s002"><label>Text S2</label>
<caption><p><bold>Method for calculating Hamming distance between two </bold>
<bold><italic>l</italic>
</bold>
<bold>-mers.</bold>
</p>
<p>(DOC)</p>
</caption>
<media xlink:href="pone.0048442.s002.doc"><caption><p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0048442.s003"><label>Dataset S1</label>
<caption><p><bold>The real biological data sets used in our experiments, including preproinsulin, DHFR, c-fos, metallothionein and Yeast ECB, which are discussed in </bold>
<xref rid="pone.0048442-Blanchette1" ref-type="bibr">[<bold>28</bold>
]</xref>
<bold>.</bold>
</p>
<p>(RAR)</p>
</caption>
<media xlink:href="pone.0048442.s003.rar"><caption><p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pone.0048442.s004"><label>Program S1</label>
<caption><p><bold>The executable program of algorithm PairMotif.</bold>
</p>
<p>(RAR)</p>
</caption>
<media xlink:href="pone.0048442.s004.rar"><caption><p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back><ref-list><title>References</title>
<ref id="pone.0048442-Das1"><label>1</label>
<mixed-citation publication-type="other">Das MK, Dai HK (2007) A survey of DNA motif finding algorithms. BMC Bioinformatics 8.</mixed-citation>
</ref>
<ref id="pone.0048442-Pevzner1"><label>2</label>
<mixed-citation publication-type="other">Pevzner PA, Sze SH (2000) Combinatorial approaches to finding subtle signals in DNA sequences. In: Altman R, Bailey TL, eds. Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology. California: AAAI Press. 269–278.</mixed-citation>
</ref>
<ref id="pone.0048442-Boucher1"><label>3</label>
<mixed-citation publication-type="other">Boucher C, Brown DG, Church P (2007) A graph clustering approach to weak motif recognition. In: Giancarlo R, Hannenhalli S, eds. Proceedings of the 7th International Workshop on Algorithms in Bioinformatics. Philadelphia: LNCS. 149–160.</mixed-citation>
</ref>
<ref id="pone.0048442-Lawrence1"><label>4</label>
<mixed-citation publication-type="journal"><name><surname>Lawrence</surname>
<given-names>CE</given-names>
</name>
, <name><surname>Altschul</surname>
<given-names>SF</given-names>
</name>
, <name><surname>Boguski</surname>
<given-names>MS</given-names>
</name>
, <name><surname>Liu</surname>
<given-names>JS</given-names>
</name>
, <name><surname>Neuwald</surname>
<given-names>AF</given-names>
</name>
, <etal>et al</etal>
 (<year>1993</year>
) <article-title>Detecting subtle sequence signals: a Gibb’s sampling strategy for multiple alignment</article-title>
. <source>Science</source>
<volume>262</volume>
: <fpage>208</fpage>
–<lpage>214</lpage>
.<pub-id pub-id-type="pmid">8211139</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0048442-Bailey1"><label>5</label>
<mixed-citation publication-type="other">Bailey TL, Elkan C (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Altman R, Brutlag D, eds. Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology. California: AAAI Press. 28–36.</mixed-citation>
</ref>
<ref id="pone.0048442-Buhler1"><label>6</label>
<mixed-citation publication-type="journal"><name><surname>Buhler</surname>
<given-names>J</given-names>
</name>
, <name><surname>Tompa</surname>
<given-names>M</given-names>
</name>
 (<year>2002</year>
) <article-title>Finding motifs using random projections</article-title>
. <source>Journal of Computational Biology</source>
<volume>9</volume>
: <fpage>225</fpage>
–<lpage>242</lpage>
.<pub-id pub-id-type="pmid">12015879</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0048442-Fratkin1"><label>7</label>
<mixed-citation publication-type="journal"><name><surname>Fratkin</surname>
<given-names>E</given-names>
</name>
, <name><surname>Naughton</surname>
<given-names>BT</given-names>
</name>
, <name><surname>Brutlag</surname>
<given-names>DL</given-names>
</name>
, <name><surname>Batzoglou</surname>
<given-names>S</given-names>
</name>
 (<year>2006</year>
) <article-title>MotifCut: regulatory motifs finding with maximum density subgraphs</article-title>
. <source>Bioinformatics</source>
<volume>22</volume>
: <fpage>150</fpage>
–<lpage>157</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0048442-Bi1"><label>8</label>
<mixed-citation publication-type="journal"><name><surname>Bi</surname>
<given-names>CP</given-names>
</name>
 (<year>2009</year>
) <article-title>A monte carlo EM algorithm for De Novo motif discovery in biomolecular sequences. IEEE/ACM Trans</article-title>
. <source>on Computational Biology and Bioinformatics</source>
<volume>6</volume>
: <fpage>370</fpage>
–<lpage>386</lpage>
.<pub-id pub-id-type="pmid">19644166</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0048442-Miller1"><label>9</label>
<mixed-citation publication-type="other">Miller AK, Print CG, Nielsen PMF, Crampin EJ (2010) A Bayesian search for transcriptional motifs. PLoS ONE 5.</mixed-citation>
</ref>
<ref id="pone.0048442-Huang1"><label>10</label>
<mixed-citation publication-type="journal"><name><surname>Huang</surname>
<given-names>CW</given-names>
</name>
, <name><surname>Lee</surname>
<given-names>WS</given-names>
</name>
, <name><surname>Hsieh</surname>
<given-names>SY</given-names>
</name>
 (<year>2011</year>
) <article-title>An improved heuristic algorithm for finding motif signals in DNA sequences. IEEE/ACM Trans</article-title>
. <source>on Computational Biology and Bioinformatics</source>
<volume>8</volume>
: <fpage>959</fpage>
–<lpage>975</lpage>
.<pub-id pub-id-type="pmid">20855921</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0048442-Evans1"><label>11</label>
<mixed-citation publication-type="journal"><name><surname>Evans</surname>
<given-names>PA</given-names>
</name>
, <name><surname>Smith</surname>
<given-names>A</given-names>
</name>
, <name><surname>Wareham</surname>
<given-names>HT</given-names>
</name>
 (<year>2003</year>
) <article-title>On the complexity of finding common approximate substrings</article-title>
. <source>Theoretical Computer Science</source>
<volume>306</volume>
: <fpage>407</fpage>
–<lpage>430</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0048442-Jones1"><label>12</label>
<mixed-citation publication-type="other">Jones NC, Pevzner PA (2004) Exhaustive Search. In: An introduction to Bioinformatics Algorithms. Cambridge: MIT Press. 83–123.</mixed-citation>
</ref>
<ref id="pone.0048442-Eskin1"><label>13</label>
<mixed-citation publication-type="journal"><name><surname>Eskin</surname>
<given-names>E</given-names>
</name>
, <name><surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
 (<year>2002</year>
) <article-title>Finding composite regulatory patterns in DNA sequences</article-title>
. <source>Bioinformatics</source>
<volume>18</volume>
: <fpage>354</fpage>
–<lpage>363</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0048442-Evans2"><label>14</label>
<mixed-citation publication-type="other">Evans PA, Smith A (2003) Toward optimal motif enumeration. In: Dehne F, Sack JR, Smid M, eds. Proceedings of the Eighth International Workshop Algorithms and Data Structures. Ottawa: LNCS. 47–58.</mixed-citation>
</ref>
<ref id="pone.0048442-Sagot1"><label>15</label>
<mixed-citation publication-type="other">Sagot MF (1998) Spelling approximate repeated or common motifs using a suffix tree. In: Lucchesi CL, Moura AV, eds. Proceedings of the Third Latin American Symposium: Theoretical Informatics. Campinas: LNCS. 374–390.</mixed-citation>
</ref>
<ref id="pone.0048442-Marsan1"><label>16</label>
<mixed-citation publication-type="journal"><name><surname>Marsan</surname>
<given-names>L</given-names>
</name>
, <name><surname>Sagot</surname>
<given-names>MF</given-names>
</name>
 (<year>2000</year>
) <article-title>Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification</article-title>
. <source>Journal of Computational Biology</source>
<volume>7</volume>
: <fpage>345</fpage>
–<lpage>362</lpage>
.<pub-id pub-id-type="pmid">11108467</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0048442-Carvalho1"><label>17</label>
<mixed-citation publication-type="other">Carvalho AM, Freitas AT, Oliveira AL, Sagot MF (2005) A highly scalable algorithm for the extraction of CIS-Regulatory regions. In: Chen YP, Wong L, eds. Proceedings of the Third Asia Pacific Bioinformatics Conference. Singapore: Imperial College Press. 273–282.</mixed-citation>
</ref>
<ref id="pone.0048442-Pisanti1"><label>18</label>
<mixed-citation publication-type="other">Pisanti N, Carvalho AM, Marsan L, Sagot MF (2006) RISOTTO: Fast extraction of motifs with mismatches. In: Correa JR, Hevia A, Kiwi MA, eds. Proceedings of the Seventh Latin American Symposium: Theoretical Informatics. Valdivia: LNCS. 757–768.</mixed-citation>
</ref>
<ref id="pone.0048442-Davila1"><label>19</label>
<mixed-citation publication-type="other">Davila J, Balla S, Rajasekaran S (2006) Space and time efficient algorithms for planted motif search. In: Yi P, Zelikovsky A, eds. Proceedings of the Second International Workshop on Bioinformatics Research and Applications. UK: LNCS. 822–829.</mixed-citation>
</ref>
<ref id="pone.0048442-Davila2"><label>20</label>
<mixed-citation publication-type="journal"><name><surname>Davila</surname>
<given-names>J</given-names>
</name>
, <name><surname>Balla</surname>
<given-names>S</given-names>
</name>
, <name><surname>Rajasekaran</surname>
<given-names>S</given-names>
</name>
 (<year>2007</year>
) <article-title>Fast and practical algorithms for planted (<italic>l</italic>
, <italic>d</italic>
) motif search. IEEE/ACM Trans</article-title>
. <source>on Computational Biology and Bioinformatics</source>
<volume>4</volume>
: <fpage>544</fpage>
–<lpage>552</lpage>
.<pub-id pub-id-type="pmid">17975266</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0048442-Ho1"><label>21</label>
<mixed-citation publication-type="other">Ho ES, Jakubowski CD, Gunderson SI (2009) iTriplet, a rule-based nucleic acid sequence motif finder. Algorithms for Molecular Biology 4.</mixed-citation>
</ref>
<ref id="pone.0048442-Dinh1"><label>22</label>
<mixed-citation publication-type="other">Dinh H, Rajasekaran S, Kundeti VK (2011) PMS5: an efficient exact algorithm for the (<italic>l</italic>
, <italic>d</italic>
)-motif finding problem. BMC Bioinformatics 12.</mixed-citation>
</ref>
<ref id="pone.0048442-Chin1"><label>23</label>
<mixed-citation publication-type="other">Chin FYL, Leung CM (2005) Voting algorithms for discovering long motifs. In: Chen YP, Wong L, eds. Proceedings of the Third Asia Pacific Bioinformatics Conference. Singapore: Imperial College Press. 261–271.</mixed-citation>
</ref>
<ref id="pone.0048442-Kuksa1"><label>24</label>
<mixed-citation publication-type="other">Kuksa PP, Pavlovic V (2010) Efficient motif finding algorithms for large-alphabet inputs. BMC Bioinformatics 11.</mixed-citation>
</ref>
<ref id="pone.0048442-Rajasekaran1"><label>25</label>
<mixed-citation publication-type="other">Rajasekaran S, Dinh H (2011) A speedup technique for (<italic>l</italic>
, <italic>d</italic>
)-motif finding algorithms. BMC Research Notes 4.</mixed-citation>
</ref>
<ref id="pone.0048442-Bulyk1"><label>26</label>
<mixed-citation publication-type="journal"><name><surname>Bulyk</surname>
<given-names>ML</given-names>
</name>
, <name><surname>Johnson</surname>
<given-names>PFL</given-names>
</name>
, <name><surname>Church</surname>
<given-names>GM</given-names>
</name>
 (<year>2002</year>
) <article-title>Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors</article-title>
. <source>Nucleic Acids Research</source>
<volume>30</volume>
: <fpage>1255</fpage>
–<lpage>1261</lpage>
.<pub-id pub-id-type="pmid">11861919</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0048442-Li1"><label>27</label>
<mixed-citation publication-type="other">Li N, Tompa M (2006) Analysis of computational approaches for motif discovery. Algorithms for Molecular Biology 1.</mixed-citation>
</ref>
<ref id="pone.0048442-Blanchette1"><label>28</label>
<mixed-citation publication-type="other">Blanchette M (2001) Algorithms for phylogenetic footprinting. In: Lengauer T, eds. Proceedings of the Fifth Annual International Conference on Computational Molecular Biology. Montreal: ACM Press. 49–58.</mixed-citation>
</ref>
<ref id="pone.0048442-Liu1"><label>29</label>
<mixed-citation publication-type="journal"><name><surname>Liu</surname>
<given-names>X</given-names>
</name>
, <name><surname>Brutlag</surname>
<given-names>DL</given-names>
</name>
, <name><surname>Liu</surname>
<given-names>JS</given-names>
</name>
 (<year>2001</year>
) <article-title>BioProspector: Discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes</article-title>
. <source>Pac Symp Biocomput</source>
<volume>2001</volume>
: <fpage>127</fpage>
–<lpage>138</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0048442-Pavesi1"><label>30</label>
<mixed-citation publication-type="journal"><name><surname>Pavesi</surname>
<given-names>G</given-names>
</name>
, <name><surname>Mauri</surname>
<given-names>G</given-names>
</name>
, <name><surname>Pesole</surname>
<given-names>G</given-names>
</name>
 (<year>2001</year>
) <article-title>An algorithm for finding signals of unknown length in DNA sequences</article-title>
. <source>Bioinformatics</source>
<volume>17</volume>
: <fpage>207</fpage>
–<lpage>214</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0048442-Tompa1"><label>31</label>
<mixed-citation publication-type="journal"><name><surname>Tompa</surname>
<given-names>M</given-names>
</name>
, <name><surname>Li</surname>
<given-names>N</given-names>
</name>
, <name><surname>Bailey</surname>
<given-names>TL</given-names>
</name>
, <name><surname>Church</surname>
<given-names>GM</given-names>
</name>
, <name><surname>Moor</surname>
<given-names>BD</given-names>
</name>
, <etal>et al</etal>
 (<year>2005</year>
) <article-title>Assessing computational tools for the discovery of transcription factor binding sites</article-title>
. <source>Nature Biotechnology</source>
<volume>23</volume>
: <fpage>137</fpage>
–<lpage>144</lpage>
.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001075 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 001075 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:3485246
   |texte=   PairMotif: A New Pattern-Driven Algorithm for Planted (l, d) DNA Motif Search
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:23119020" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration MERS

PairMotif: A New Pattern-Driven Algorithm for Planted (l, d) DNA Motif Search

PairMotif: A New Pattern-Driven Algorithm for Planted (l, d) DNA Motif Search

Source :

Abstract

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki