Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000D340 ( Pmc/Corpus ); précédent : 000D339; suivant : 000D341 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">PairMotif+: A Fast and Effective Algorithm for De Novo Motif Discovery in DNA sequences</title>
<author>
<name sortKey="Yu, Qiang" sort="Yu, Qiang" uniqKey="Yu Q" first="Qiang" last="Yu">Qiang Yu</name>
</author>
<author>
<name sortKey="Huo, Hongwei" sort="Huo, Hongwei" uniqKey="Huo H" first="Hongwei" last="Huo">Hongwei Huo</name>
</author>
<author>
<name sortKey="Zhang, Yipu" sort="Zhang, Yipu" uniqKey="Zhang Y" first="Yipu" last="Zhang">Yipu Zhang</name>
</author>
<author>
<name sortKey="Guo, Hongzhi" sort="Guo, Hongzhi" uniqKey="Guo H" first="Hongzhi" last="Guo">Hongzhi Guo</name>
</author>
<author>
<name sortKey="Guo, Haitao" sort="Guo, Haitao" uniqKey="Guo H" first="Haitao" last="Guo">Haitao Guo</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">23678291</idno>
<idno type="pmc">3654438</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3654438</idno>
<idno type="RBID">PMC:3654438</idno>
<idno type="doi">10.7150/ijbs.5786</idno>
<date when="2013">2013</date>
<idno type="wicri:Area/Pmc/Corpus">000D34</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000D34</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">PairMotif+: A Fast and Effective Algorithm for De Novo Motif Discovery in DNA sequences</title>
<author>
<name sortKey="Yu, Qiang" sort="Yu, Qiang" uniqKey="Yu Q" first="Qiang" last="Yu">Qiang Yu</name>
</author>
<author>
<name sortKey="Huo, Hongwei" sort="Huo, Hongwei" uniqKey="Huo H" first="Hongwei" last="Huo">Hongwei Huo</name>
</author>
<author>
<name sortKey="Zhang, Yipu" sort="Zhang, Yipu" uniqKey="Zhang Y" first="Yipu" last="Zhang">Yipu Zhang</name>
</author>
<author>
<name sortKey="Guo, Hongzhi" sort="Guo, Hongzhi" uniqKey="Guo H" first="Hongzhi" last="Guo">Hongzhi Guo</name>
</author>
<author>
<name sortKey="Guo, Haitao" sort="Guo, Haitao" uniqKey="Guo H" first="Haitao" last="Guo">Haitao Guo</name>
</author>
</analytic>
<series>
<title level="j">International Journal of Biological Sciences</title>
<idno type="eISSN">1449-2288</idno>
<imprint>
<date when="2013">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>The planted (
<italic>l</italic>
,
<italic>d</italic>
) motif search is one of the most widely studied problems in bioinformatics, which plays an important role in the identification of transcription factor binding sites in DNA sequences. However, it is still a challenging task to identify highly degenerate motifs, since current algorithms either output the exact results with a high computational cost or accomplish the computation in a short time but very often fall into a local optimum. In order to make a better trade-off between accuracy and efficiency, we propose a new pattern-driven algorithm, named PairMotif+. At first, some pairs of
<italic>l</italic>
-mers are extracted from input sequences according to probabilistic analysis and statistical method so that one or more pairs of motif instances are included in them. Then an approximate strategy for refining pairs of
<italic>l</italic>
-mers with high accuracy is adopted in order to avoid the verification of most candidate motifs. Experimental results on the simulated data show that PairMotif+ can solve various (
<italic>l, d</italic>
) problems within an hour on a PC with 2.67 GHz processor, and has a better identification accuracy than the compared algorithms MEME, AlignACE and VINE. Also, the validity of the proposed algorithm is tested on multiple real data sets.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Buhler, J" uniqKey="Buhler J">J Buhler</name>
</author>
<author>
<name sortKey="Tompa, M" uniqKey="Tompa M">M Tompa</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Evans, Pa" uniqKey="Evans P">PA Evans</name>
</author>
<author>
<name sortKey="Smith, Ad" uniqKey="Smith A">AD Smith</name>
</author>
<author>
<name sortKey="Wareham, Ht" uniqKey="Wareham H">HT Wareham</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ho, Es" uniqKey="Ho E">ES Ho</name>
</author>
<author>
<name sortKey="Jakubowski, Cd" uniqKey="Jakubowski C">CD Jakubowski</name>
</author>
<author>
<name sortKey="Gunderson, Si" uniqKey="Gunderson S">SI Gunderson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sun, Hq" uniqKey="Sun H">HQ Sun</name>
</author>
<author>
<name sortKey="Low, Myh" uniqKey="Low M">MYH Low</name>
</author>
<author>
<name sortKey="Hsu, Wj" uniqKey="Hsu W">WJ Hsu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Davila, J" uniqKey="Davila J">J Davila</name>
</author>
<author>
<name sortKey="Balla, S" uniqKey="Balla S">S Balla</name>
</author>
<author>
<name sortKey="Rajasekaran, S" uniqKey="Rajasekaran S">S Rajasekaran</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bailey, Tl" uniqKey="Bailey T">TL Bailey</name>
</author>
<author>
<name sortKey="Williams, N" uniqKey="Williams N">N Williams</name>
</author>
<author>
<name sortKey="Misleh, C" uniqKey="Misleh C">C Misleh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lawrence, Ce" uniqKey="Lawrence C">CE Lawrence</name>
</author>
<author>
<name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author>
<name sortKey="Boguski, Ms" uniqKey="Boguski M">MS Boguski</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, L" uniqKey="Li L">L Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Neuwald, Af" uniqKey="Neuwald A">AF Neuwald</name>
</author>
<author>
<name sortKey="Liu, Js" uniqKey="Liu J">JS Liu</name>
</author>
<author>
<name sortKey="Lawrence, Ce" uniqKey="Lawrence C">CE Lawrence</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hughes, Jd" uniqKey="Hughes J">JD Hughes</name>
</author>
<author>
<name sortKey="Estep, Pw" uniqKey="Estep P">PW Estep</name>
</author>
<author>
<name sortKey="Tavazoie, S" uniqKey="Tavazoie S">S Tavazoie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Thijs, G" uniqKey="Thijs G">G Thijs</name>
</author>
<author>
<name sortKey="Lescot, M" uniqKey="Lescot M">M Lescot</name>
</author>
<author>
<name sortKey="Marchal, K" uniqKey="Marchal K">K Marchal</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tang, Me" uniqKey="Tang M">ME Tang</name>
</author>
<author>
<name sortKey="Krogh, A" uniqKey="Krogh A">A Krogh</name>
</author>
<author>
<name sortKey="Winther, O" uniqKey="Winther O">O Winther</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kim, N" uniqKey="Kim N">N Kim</name>
</author>
<author>
<name sortKey="Tharakaraman, K" uniqKey="Tharakaraman K">K Tharakaraman</name>
</author>
<author>
<name sortKey="Mari O Ramirez, L" uniqKey="Mari O Ramirez L">L Mariño-Ramírez</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Miller, Ak" uniqKey="Miller A">AK Miller</name>
</author>
<author>
<name sortKey="Print, Cg" uniqKey="Print C">CG Print</name>
</author>
<author>
<name sortKey="Nielsen, Pmf" uniqKey="Nielsen P">PMF Nielsen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jajamovich, Gh" uniqKey="Jajamovich G">GH Jajamovich</name>
</author>
<author>
<name sortKey="Wang, Xd" uniqKey="Wang X">XD Wang</name>
</author>
<author>
<name sortKey="Arkin, Ap" uniqKey="Arkin A">AP Arkin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fratkin, E" uniqKey="Fratkin E">E Fratkin</name>
</author>
<author>
<name sortKey="Naughton, Bt" uniqKey="Naughton B">BT Naughton</name>
</author>
<author>
<name sortKey="Brutlag, Dl" uniqKey="Brutlag D">DL Brutlag</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Boucher, C" uniqKey="Boucher C">C Boucher</name>
</author>
<author>
<name sortKey="King, J" uniqKey="King J">J King</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huang, Cw" uniqKey="Huang C">CW Huang</name>
</author>
<author>
<name sortKey="Lee, Ws" uniqKey="Lee W">WS Lee</name>
</author>
<author>
<name sortKey="Hsieh, Sy" uniqKey="Hsieh S">SY Hsieh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jones, Nc" uniqKey="Jones N">NC Jones</name>
</author>
<author>
<name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, X" uniqKey="Yang X">X Yang</name>
</author>
<author>
<name sortKey="Rajapakse, Jc" uniqKey="Rajapakse J">JC Rajapakse</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vanet, A" uniqKey="Vanet A">A Vanet</name>
</author>
<author>
<name sortKey="Marsan, L" uniqKey="Marsan L">L Marsan</name>
</author>
<author>
<name sortKey="Labigne, A" uniqKey="Labigne A">A Labigne</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marsan, L" uniqKey="Marsan L">L Marsan</name>
</author>
<author>
<name sortKey="Sagot, M" uniqKey="Sagot M">M Sagot</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pavesi, G" uniqKey="Pavesi G">G Pavesi</name>
</author>
<author>
<name sortKey="Mauri, G" uniqKey="Mauri G">G Mauri</name>
</author>
<author>
<name sortKey="Pesole, G" uniqKey="Pesole G">G Pesole</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Eskin, E" uniqKey="Eskin E">E Eskin</name>
</author>
<author>
<name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rajasekaran, S" uniqKey="Rajasekaran S">S Rajasekaran</name>
</author>
<author>
<name sortKey="Balla, S" uniqKey="Balla S">S Balla</name>
</author>
<author>
<name sortKey="Huang, Ch" uniqKey="Huang C">CH Huang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rajasekaran, S" uniqKey="Rajasekaran S">S Rajasekaran</name>
</author>
<author>
<name sortKey="Balla, S" uniqKey="Balla S">S Balla</name>
</author>
<author>
<name sortKey="Huang, Ch" uniqKey="Huang C">CH Huang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dinh, H" uniqKey="Dinh H">H Dinh</name>
</author>
<author>
<name sortKey="Rajasekaran, S" uniqKey="Rajasekaran S">S Rajasekaran</name>
</author>
<author>
<name sortKey="Kundeti, Vk" uniqKey="Kundeti V">VK Kundeti</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dinh, H" uniqKey="Dinh H">H Dinh</name>
</author>
<author>
<name sortKey="Rajasekaran, S" uniqKey="Rajasekaran S">S Rajasekaran</name>
</author>
<author>
<name sortKey="Davila, J" uniqKey="Davila J">J Davila</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yu, Q" uniqKey="Yu Q">Q Yu</name>
</author>
<author>
<name sortKey="Huo, Hw" uniqKey="Huo H">HW Huo</name>
</author>
<author>
<name sortKey="Zhang, Yp" uniqKey="Zhang Y">YP Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tompa, M" uniqKey="Tompa M">M Tompa</name>
</author>
<author>
<name sortKey="Li, N" uniqKey="Li N">N Li</name>
</author>
<author>
<name sortKey="Bailey, Tl" uniqKey="Bailey T">TL Bailey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hu, J" uniqKey="Hu J">J Hu</name>
</author>
<author>
<name sortKey="Li, B" uniqKey="Li B">B Li</name>
</author>
<author>
<name sortKey="Kihara, D" uniqKey="Kihara D">D Kihara</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hertz, Gz" uniqKey="Hertz G">GZ Hertz</name>
</author>
<author>
<name sortKey="Hartzell, Gw" uniqKey="Hartzell G">GW Hartzell</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stormo, Gd" uniqKey="Stormo G">GD Stormo</name>
</author>
<author>
<name sortKey="Hartzell, Gw" uniqKey="Hartzell G">GW Hartzell</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Crooks, Ge" uniqKey="Crooks G">GE Crooks</name>
</author>
<author>
<name sortKey="Hon, G" uniqKey="Hon G">G Hon</name>
</author>
<author>
<name sortKey="Chandonia, Jm" uniqKey="Chandonia J">JM Chandonia</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Int J Biol Sci</journal-id>
<journal-id journal-id-type="iso-abbrev">Int. J. Biol. Sci</journal-id>
<journal-id journal-id-type="publisher-id">ijbs</journal-id>
<journal-title-group>
<journal-title>International Journal of Biological Sciences</journal-title>
</journal-title-group>
<issn pub-type="epub">1449-2288</issn>
<publisher>
<publisher-name>Ivyspring International Publisher</publisher-name>
<publisher-loc>Sydney</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">23678291</article-id>
<article-id pub-id-type="pmc">3654438</article-id>
<article-id pub-id-type="doi">10.7150/ijbs.5786</article-id>
<article-id pub-id-type="publisher-id">ijbsv09p0412</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Paper</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>PairMotif+: A Fast and Effective Algorithm for De Novo Motif Discovery in DNA sequences</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Yu</surname>
<given-names>Qiang</given-names>
</name>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Huo</surname>
<given-names>Hongwei</given-names>
</name>
<xref ref-type="corresp" rid="FNA_envelop"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Zhang</surname>
<given-names>Yipu</given-names>
</name>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Guo</surname>
<given-names>Hongzhi</given-names>
</name>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Guo</surname>
<given-names>Haitao</given-names>
</name>
</contrib>
</contrib-group>
<aff>School of Computer Science and Technology, Xidian University, Xi'an, 710071, China.</aff>
<author-notes>
<corresp id="FNA_envelop">✉ Corresponding author: Prof. Hongwei Huo, School of Computer Science and Technology, Xidian University, Xi'an, 710071, China. Phone: 86-29-88201489, Fax: 86-29-88201489, E-mail:
<email>hwhuo@mail.xidian.edu.cn</email>
.</corresp>
<fn fn-type="conflict">
<p>Conflict of Interest: All authors have declared that no conflicts of interest exist and agree with the contents of the manuscript for publication.</p>
</fn>
</author-notes>
<pub-date pub-type="collection">
<year>2013</year>
</pub-date>
<pub-date pub-type="epub">
<day>29</day>
<month>4</month>
<year>2013</year>
</pub-date>
<volume>9</volume>
<issue>4</issue>
<fpage>412</fpage>
<lpage>424</lpage>
<history>
<date date-type="received">
<day>1</day>
<month>1</month>
<year>2013</year>
</date>
<date date-type="accepted">
<day>15</day>
<month>4</month>
<year>2013</year>
</date>
</history>
<permissions>
<copyright-statement>© Ivyspring International Publisher. This is an open-access article distributed under the terms of the Creative Commons License (http://creativecommons.org/licenses/by-nc-nd/3.0/). Reproduction is permitted for personal, noncommercial use, provided that the article is in whole, unmodified, and properly cited.</copyright-statement>
<copyright-year>2013</copyright-year>
</permissions>
<abstract>
<p>The planted (
<italic>l</italic>
,
<italic>d</italic>
) motif search is one of the most widely studied problems in bioinformatics, which plays an important role in the identification of transcription factor binding sites in DNA sequences. However, it is still a challenging task to identify highly degenerate motifs, since current algorithms either output the exact results with a high computational cost or accomplish the computation in a short time but very often fall into a local optimum. In order to make a better trade-off between accuracy and efficiency, we propose a new pattern-driven algorithm, named PairMotif+. At first, some pairs of
<italic>l</italic>
-mers are extracted from input sequences according to probabilistic analysis and statistical method so that one or more pairs of motif instances are included in them. Then an approximate strategy for refining pairs of
<italic>l</italic>
-mers with high accuracy is adopted in order to avoid the verification of most candidate motifs. Experimental results on the simulated data show that PairMotif+ can solve various (
<italic>l, d</italic>
) problems within an hour on a PC with 2.67 GHz processor, and has a better identification accuracy than the compared algorithms MEME, AlignACE and VINE. Also, the validity of the proposed algorithm is tested on multiple real data sets.</p>
</abstract>
<kwd-group>
<kwd>Motif search</kwd>
<kwd>Transcription factor binding sites</kwd>
<kwd>Pattern-driven algorithms.</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec>
<title>Introduction</title>
<p>DNA motifs refer to short DNA segments that are regulatory elements bound by proteins such as transcription factors. Motif discovery is to find the unknown motifs in the given sequences, which plays an important role in locating transcription factor binding sites (TFBSs) in DNA sequences. The planted (
<italic>l</italic>
,
<italic>d</italic>
) motif search (PMS)
<xref ref-type="bibr" rid="B1">1</xref>
, which is raised from this research, has become a widely accepted motif search problem formulation.</p>
<p>
<bold>Problem Definition.</bold>
Given a set of
<italic>n</italic>
-length sequences
<italic>S </italic>
= {
<italic>s</italic>
<sub>l</sub>
,
<italic>s</italic>
<sub>2</sub>
, … ,
<italic>s
<sub>t</sub>
</italic>
} over the alphabet {A, G, C, T} and nonnegative integers
<italic>l</italic>
and
<italic>d</italic>
, satisfying 0 ≤
<italic>d </italic>
<
<italic>l </italic>
<
<italic>n</italic>
. The PMS problem is to find an
<italic>l</italic>
-mer (i.e., an
<italic>l</italic>
-length string)
<italic>m</italic>
, such that each sequence
<italic>s
<sub>i</sub>
</italic>
contains an
<italic>l</italic>
-mer
<italic>m
<sub>i</sub>
</italic>
differing from
<italic>m</italic>
in at most
<italic>d</italic>
positions. The
<italic>l</italic>
-mer
<italic>m</italic>
is called an (
<italic>l</italic>
,
<italic>d</italic>
) motif and each
<italic>m
<sub>i</sub>
</italic>
is called an instance of
<italic>m</italic>
.</p>
<p>The PMS problem is NP-hard
<xref ref-type="bibr" rid="B2">2</xref>
. With
<italic>t</italic>
and
<italic>n</italic>
fixed, different values of
<italic>l</italic>
and
<italic>d</italic>
form various PMS instances. Usually, the motif length
<italic>l</italic>
is 5 to 25 basepairs (bps). For a given motif length
<italic>l</italic>
, the larger the degenerate positions
<italic>d</italic>
, the more difficult it is to identify the planted (
<italic>l</italic>
,
<italic>d</italic>
) motif in input sequences. Specifically, some researchers
<xref ref-type="bibr" rid="B3">3</xref>
,
<xref ref-type="bibr" rid="B4">4</xref>
use 2
<italic>d</italic>
-neighborhood probability (i.e., the probability that the Hamming distance between two random
<italic>l</italic>
-mers is not larger than 2
<italic>d</italic>
) to measure the difficulty of solving different PMS instances, since it is a good indicator to reflect the degree of degeneracy of PMS instances
<xref ref-type="bibr" rid="B3">3</xref>
.</p>
<p>The algorithms for PMS are categorized as approximate and exact depending on whether they are guaranteed to find the optimal motif always or not
<xref ref-type="bibr" rid="B5">5</xref>
. The approximate algorithms, which commonly model motifs using position weight matrix (PWM), can report results in a short time but not guarantee a global optimum. Most approximate recognition algorithms use potent statistical techniques. For example, the most popular algorithms MEME
<xref ref-type="bibr" rid="B6">6</xref>
and Gibbs Sampling
<xref ref-type="bibr" rid="B7">7</xref>
adopt Expectation Maximization (EM) and Gibbs sampling techniques, respectively. Based on MEME, there are the extension algorithms, like PROJECTION
<xref ref-type="bibr" rid="B1">1</xref>
and GADEM
<xref ref-type="bibr" rid="B8">8</xref>
. PROJECTION partitions all
<italic>l</italic>
-mers in
<italic>S</italic>
into many buckets and selects some valid buckets that contain several occurrences of the desired motif and little else, in order to provide a good initial state for the EM refinement. GADEM employs a genetic algorithm with an embedded EM algorithm to improve initial PWMs. The modification of Gibbs Sampling is described in
<xref ref-type="bibr" rid="B9">9</xref>
-
<xref ref-type="bibr" rid="B11">11</xref>
. In recent years, Bayesian theory has also been introduced in the field of motif search, such as BayesMD
<xref ref-type="bibr" rid="B12">12</xref>
, A-GLAM
<xref ref-type="bibr" rid="B13">13</xref>
, SBaSeTraM
<xref ref-type="bibr" rid="B14">14</xref>
and BAMBI
<xref ref-type="bibr" rid="B15">15</xref>
. Besides the statistical methods, some graph-theoretic methods either based on clustering or on heuristic search are proposed to solve the motif search problem, such as MotifCut
<xref ref-type="bibr" rid="B16">16</xref>
, sMCL-WMR
<xref ref-type="bibr" rid="B17">17</xref>
and Vine
<xref ref-type="bibr" rid="B18">18</xref>
. In the associated graphical model, each node corresponds to an
<italic>l</italic>
-mer in input sequences and each edge represents the similarity between the two
<italic>l</italic>
-mers it connects.</p>
<p>Exact recognition algorithms, which use consensus to represent motifs, find all (
<italic>l</italic>
,
<italic>d</italic>
) motifs and the optimal one by traversing the whole search space. Since all exact algorithms produce the consistent results
<xref ref-type="bibr" rid="B19">19</xref>
, the main concern on them is the time performance. Based on the graphical model of motif search, some exact algorithms, such as DPCFG
<xref ref-type="bibr" rid="B20">20</xref>
and RecMotif
<xref ref-type="bibr" rid="B4">4</xref>
, find all maximum cliques in the graph, with a search space of
<italic>O</italic>
(
<italic>n
<sup>t</sup>
</italic>
). The time performance of these algorithms does not depend on the motif length, but it is difficult for these algorithms to identify highly degenerate motifs because the associated graphs are so dense that numerous cliques need to be verified. There are some other exact algorithms based on pattern-driven. They verify all string patterns of length
<italic>l</italic>
, and output the patterns that occur in all input sequences with at most
<italic>d </italic>
mutations. The initial search space of these algorithms,
<italic>O</italic>
(4
<italic>
<sup>l</sup>
</italic>
), is much smaller than
<italic>O</italic>
(
<italic>n
<sup>t</sup>
</italic>
). Therefore, the recent research of exact recognition algorithms mainly concentrates on the pattern-driven algorithms, including the series of suffix tree algorithms
<xref ref-type="bibr" rid="B21">21</xref>
-
<xref ref-type="bibr" rid="B24">24</xref>
and the series of PMS algorithms
<xref ref-type="bibr" rid="B5">5</xref>
,
<xref ref-type="bibr" rid="B25">25</xref>
-
<xref ref-type="bibr" rid="B28">28</xref>
. Pattern-driven algorithms are good at finding motifs of length smaller than 20 bps, but their time overhead or space requirement will become unrealistic with the increase of the motif length.</p>
<p>Although many recognition algorithms have been proposed to solve the PMS problem, few of them can make a good trade-off between accuracy and time performance. In identifying highly degenerate motifs, they either output the exact results with a high computational cost or accomplish the computation in a short time but have a low accuracy. Thus, it is a meaningful work to design an algorithm that can get high accuracy results within a reasonable time, e.g., an hour on personal computers. To achieve this goal, we propose a new algorithm named PairMotif+, by designing the approximate version of PairMotif
<xref ref-type="bibr" rid="B29">29</xref>
. PairMotif, our recent exact algorithm, adopts the following idea: select multiple pairs of
<italic>l</italic>
-mers in which there must exist a pair of motif instances, by traversing reference sequences; then, refine each pair of
<italic>l</italic>
-mers, namely verify whether each
<italic>d</italic>
-neighbor of the pair of
<italic>l</italic>
-mers is a valid (
<italic>l</italic>
,
<italic>d</italic>
) motif. Obviously, the time performance of PairMotif depends on two values: the number of the selected pairs of
<italic>l</italic>
-mers and the number of the candidate motifs generated from each pair of
<italic>l</italic>
-mers.</p>
<p>The main idea of PairMotif+ is to reduce the above two values that determine the time performance. PairMotif+ consists of three steps. First, extract some pairs of
<italic>l</italic>
-mers from input sequences according to probabilistic analysis so that more than half of the pairs of motif instances are included in them. Second, analyze the weights of the extracted pairs of
<italic>l</italic>
-mers and filter out the pairs with small weights, ensuring at least one pair of motif instances are included in the remaining pairs of
<italic>l</italic>
-mers. Third, in refining each pairs of
<italic>l</italic>
-mers, use an approximate strategy with high accuracy to avoid the verification of most candidate motifs. Experimental results show that PairMotif+ can solve various PMS instances within an hour on a PC with 2.67 GHz processor, and outperforms the competition in identification accuracy.</p>
</sec>
<sec sec-type="methods">
<title>Methods</title>
<sec>
<title>Foundations</title>
<p>In our recent work
<xref ref-type="bibr" rid="B29">29</xref>
, we discussed how to partition and traverse the
<italic>d</italic>
-neighbors (candidate motifs) shared by a pair of
<italic>l</italic>
-mers, which is the basic of refining pairs of
<italic>l</italic>
-mers in this paper. This section provides a brief description of the partition and traversing methods.</p>
<p>
<bold>Definition 1.</bold>
Given a pair of
<italic>l</italic>
-mers
<italic>x</italic>
<sub>1</sub>
and
<italic>x</italic>
<sub>2</sub>
, the common
<italic>d</italic>
-neighbors (candidate motifs),
<italic>M
<sub>d</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
), is defined to be {
<italic>y</italic>
: |
<italic>y</italic>
| =
<italic>l</italic>
,
<italic>d
<sub>H</sub>
</italic>
(
<italic>y</italic>
,
<italic>x</italic>
<sub>1</sub>
) ≤
<italic>d</italic>
and
<italic>d
<sub>H</sub>
</italic>
(
<italic>y</italic>
,
<italic>x</italic>
<sub>2</sub>
) ≤
<italic>d</italic>
}, where
<italic>d
<sub>H</sub>
</italic>
(
<bold>·</bold>
) denotes the Hamming distance between two
<italic>l</italic>
-mers.</p>
<p>
<bold>Definition 2.</bold>
Given a pair of
<italic>l</italic>
-mers
<italic>x</italic>
<sub>1</sub>
and
<italic>x</italic>
<sub>2</sub>
and another
<italic>l</italic>
-mer
<italic>y</italic>
, the
<italic>l</italic>
positions in the alignment of these three
<italic>l</italic>
-mers can be divided into four categories:
<italic>P</italic>
<sub>00</sub>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
,
<italic>y</italic>
),
<italic>P</italic>
<sub>01</sub>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
,
<italic>y</italic>
),
<italic>P</italic>
<sub>10</sub>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
,
<italic>y</italic>
) and
<italic>P</italic>
<sub>11</sub>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
,
<italic>y</italic>
). For each position
<italic>i</italic>
(1 ≤
<italic>i </italic>
<italic>l</italic>
), assume that it belongs to
<italic>P
<sub>ab</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
,
<italic>y</italic>
). Then,
<italic>a</italic>
is 1 if
<italic>x</italic>
<sub>1</sub>
[
<italic>i</italic>
] =
<italic>x</italic>
<sub>2</sub>
[
<italic>i</italic>
], 0 otherwise;
<italic>b</italic>
is 1 if either
<italic>y</italic>
[
<italic>i</italic>
] =
<italic>x</italic>
<sub>1</sub>
[
<italic>i</italic>
] or
<italic>y</italic>
[
<italic>i</italic>
] =
<italic>x</italic>
<sub>2</sub>
[
<italic>i</italic>
], 0 otherwise. Fig.
<xref ref-type="fig" rid="F1">1</xref>
shows an example for partitioning the positions in the alignment of three
<italic>l</italic>
-mers.</p>
<p>
<bold>Definition 3.</bold>
Given a pair of
<italic>l</italic>
-mers
<italic>x</italic>
<sub>1</sub>
and
<italic>x</italic>
<sub>2</sub>
and another
<italic>l</italic>
-mer
<italic>y </italic>
<italic>M
<sub>d</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
), the mapping relation from
<italic>x</italic>
<sub>1</sub>
and
<italic>x</italic>
<sub>2</sub>
to
<italic>y</italic>
,
<italic>R</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
,
<italic>y</italic>
), is defined to be a 2-tuple <|
<italic>P</italic>
<sub>10</sub>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
,
<italic>y</italic>
)|, |
<italic>P</italic>
<sub>00</sub>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
,
<italic>y</italic>
)|>. Furthermore, the mapping relation from
<italic>x</italic>
<sub>1</sub>
and
<italic>x</italic>
<sub>2 </sub>
to
<italic>M
<sub>d</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
),
<italic>R</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
), is defined to be</p>
<disp-formula id="E1">
<inline-graphic xlink:href="ijbsv09p0412g02.jpg" mimetype="image"></inline-graphic>
<label>(1)</label>
</disp-formula>
<p>Given a pair of
<italic>l</italic>
-mers
<italic>x</italic>
<sub>1</sub>
and
<italic>x</italic>
<sub>2</sub>
, the elements in
<italic> R</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) implies the approach to partitioning and traversing the candidate motif set
<italic>M
<sub>d</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
). We first discuss how to compute
<italic>R</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
). For any candidate motif
<italic>y</italic>
in
<italic>M
<sub>d</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
), let
<italic>R</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
,
<italic>y</italic>
) = <
<italic>α</italic>
,
<italic>β</italic>
>. From Definition 2 and 3,
<italic>α</italic>
represents the number of positions at which
<italic>x</italic>
<sub>1</sub>
[
<bold>·</bold>
] =
<italic>x</italic>
<sub>2</sub>
[
<bold>·</bold>
],
<italic>y</italic>
[
<bold>·</bold>
] ≠
<italic>x</italic>
<sub>1</sub>
[
<bold>·</bold>
] and
<italic>y</italic>
[
<bold>·</bold>
] ≠
<italic>x</italic>
<sub>2</sub>
[
<bold>·</bold>
];
<italic>β</italic>
represents the number of positions at which
<italic>x</italic>
<sub>1</sub>
[
<bold>·</bold>
] ≠
<italic>x</italic>
<sub>2</sub>
[
<bold>·</bold>
],
<italic>y</italic>
[
<bold>·</bold>
] ≠
<italic>x</italic>
<sub>1</sub>
[
<bold>·</bold>
] and
<italic>y</italic>
[
<bold>·</bold>
] ≠
<italic>x</italic>
<sub>2</sub>
[
<bold>·</bold>
]. Thus, we have 0 ≤
<italic>α </italic>
<italic>l</italic>
-
<italic>d
<sub>H</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
), 0 ≤
<italic>β </italic>
<italic> d
<sub>H</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) and
<italic>d
<sub>H</sub>
</italic>
(
<italic>y</italic>
,
<italic>x</italic>
<sub>1</sub>
) +
<italic>d
<sub>H</sub>
</italic>
(
<italic>y</italic>
,
<italic>x</italic>
<sub>2</sub>
) = 2
<italic>α</italic>
+ 2
<italic>β </italic>
+ (
<italic>d
<sub>H</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) -
<italic>β</italic>
). Furthermore, we have
<italic>d
<sub>H</sub>
</italic>
(
<italic>y</italic>
,
<italic>x</italic>
<sub>1</sub>
) +
<italic>d
<sub>H</sub>
</italic>
(
<italic>y</italic>
,
<italic>x</italic>
<sub>2</sub>
) ≤ 2
<italic>d</italic>
because
<italic>y</italic>
is the
<italic>d</italic>
-neighbor of both
<italic>x</italic>
<sub>1</sub>
and
<italic>x</italic>
<sub>2</sub>
. Based on these considerations, we obtain inequalities (2). Obviously, the values of
<italic>α</italic>
and
<italic>β </italic>
are determined by
<italic>d
<sub>H</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
), and
<italic>R</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) can be calculated by listing all 2-tuples <
<italic>α</italic>
,
<italic>β</italic>
> satisfying (2). For example, for the PMS instance (15, 4),
<italic>R</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) = {<0, 0>, <0, 1>, <0, 2>, <0, 3>, <0, 4>, <1, 0>, <1, 1>, <1, 2>, <2, 0>} when
<italic>d
<sub>H</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) = 4.</p>
<disp-formula id="E2">
<inline-graphic xlink:href="ijbsv09p0412g03.jpg" mimetype="image"></inline-graphic>
<label>(2)</label>
</disp-formula>
<p>Based on the different 2-tuples in
<italic>R</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
), the candidate motif set
<italic>M
<sub>d</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) can be partitioned to |
<italic>R</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
)| mutually disjoint subsets. For each <
<italic>α</italic>
,
<italic>β</italic>
> in
<italic>R</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
), the corresponding subset of
<italic>M
<sub>d</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) is denoted by
<italic>M
<sub>d</sub>
</italic>
<sub><
<italic>α</italic>
,
<italic>β</italic>
></sub>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
), namely
<italic>M
<sub>d</sub>
</italic>
<sub><
<italic>α</italic>
,
<italic>β</italic>
></sub>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) = {
<italic>y</italic>
:
<italic>y</italic>
<italic>M
<sub>d</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) and
<italic>R</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
,
<italic>y</italic>
) = <
<italic>α</italic>
,
<italic>β</italic>
>}. Assume that <
<italic>α</italic>
,
<italic>β</italic>
> and <
<italic>α</italic>
',
<italic>β</italic>
'> are two different elements of
<italic>R</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
), then we have
<italic>M
<sub>d</sub>
</italic>
<sub><
<italic>α</italic>
,
<italic>β</italic>
></sub>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) ∩
<italic>M
<sub>d</sub>
</italic>
<sub><
<italic>α</italic>
',
<italic>β</italic>
'></sub>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) =
<italic>Ф </italic>
according to Definition 3. Since
<italic>R</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) represents the mapping relation from
<italic>x</italic>
<sub>1</sub>
and
<italic>x</italic>
<sub>2 </sub>
to all candidate motifs, the partition of
<italic>M
<sub>d</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) is:</p>
<disp-formula id="E3">
<italic>M
<sub>d</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) = {
<italic>M
<sub>d</sub>
</italic>
<sub><
<italic>α</italic>
,
<italic>β</italic>
></sub>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
): <
<italic>α</italic>
,
<italic>β</italic>
>∈
<italic>R</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
)}
<label>(3)</label>
</disp-formula>
<p>In terms of equation (3), we can traverse the candidate motifs derived from
<italic>x</italic>
<sub>1</sub>
and
<italic>x</italic>
<sub>2</sub>
, by generating the mutually disjoint subsets of
<italic>M
<sub>d</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) one by one. For each <
<italic>α</italic>
,
<italic>β</italic>
> in
<italic>R</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
), the candidate motifs in
<italic>M
<sub>d</sub>
</italic>
<sub><
<italic>α</italic>
,
<italic>β</italic>
></sub>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) are generated as follows. First, set the initial candidate motif
<italic>y</italic>
as
<italic>x</italic>
<sub>2</sub>
. Second, select
<italic>α </italic>
positions from the positions at which
<italic>x</italic>
<sub>1</sub>
[
<bold>·</bold>
] =
<italic>x</italic>
<sub>2</sub>
[
<bold>·</bold>
], and for each of these
<italic>α </italic>
positions, change
<italic>y</italic>
[
<bold>·</bold>
] to one of the three characters different from
<italic>x</italic>
<sub>1</sub>
[
<bold>·</bold>
]. Third, select
<italic>β </italic>
positions from the positions at which
<italic>x</italic>
<sub>1</sub>
[
<bold>·</bold>
] ≠
<italic>x</italic>
<sub>2</sub>
[
<bold>·</bold>
], and for each of these
<italic>β </italic>
positions, change
<italic>y</italic>
[
<bold>·</bold>
] to one of the two characters different from
<italic>x</italic>
<sub>1</sub>
[
<bold>·</bold>
] and
<italic>x</italic>
<sub>2</sub>
[
<bold>·</bold>
]. Fourth, select a part of positions from the positions at which
<italic>x</italic>
<sub>1</sub>
[
<bold>·</bold>
] ≠
<italic>x</italic>
<sub>2</sub>
[
<bold>·</bold>
] except for those selected in the previous step, and change
<italic>y</italic>
[
<bold>·</bold>
] to
<italic>x</italic>
<sub>1</sub>
[
<bold>·</bold>
] for each of these positions. More details about these steps can be found in the reference
<xref ref-type="bibr" rid="B29">29</xref>
. According to the process of generating candidate motifs, the size of
<italic>M
<sub>d</sub>
</italic>
<sub><
<italic>α</italic>
,
<italic>β</italic>
></sub>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) is calculated by (4) where
<italic>d
<sub>H</sub>
</italic>
denotes the Hamming distance between
<italic>x</italic>
<sub>1</sub>
and
<italic>x</italic>
<sub>2</sub>
.</p>
<disp-formula id="E4">
<inline-graphic xlink:href="ijbsv09p0412g04.jpg" mimetype="image"></inline-graphic>
<label>(4)</label>
</disp-formula>
</sec>
<sec>
<title>Step 1: Extracting Pairs of l-mers</title>
<p>PairMotif+ only extracts the pair of
<italic>l</italic>
-mers that contains two
<italic>l</italic>
-mers
<italic>x</italic>
<sub>1</sub>
and
<italic>x</italic>
<sub>2</sub>
coming from different sequences, i.e.,
<italic>x</italic>
<sub>1</sub>
<italic>
<sub>l</sub>
s
<sub>i</sub>
</italic>
,
<italic>x</italic>
<sub>2 </sub>
<italic>
<sub>l</sub>
s
<sub>j</sub>
</italic>
and
<italic>i</italic>
<italic>j</italic>
. Thus, the pair of
<italic>l</italic>
-mers
<italic>x</italic>
<sub>1</sub>
and
<italic>x</italic>
<sub>2</sub>
can be denoted by (
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) if
<italic>i</italic>
<
<italic>j</italic>
, (
<italic>x</italic>
<sub>2</sub>
,
<italic>x</italic>
<sub>1</sub>
) otherwise. The set of all pairs of
<italic>l</italic>
-mers in input sequences
<italic>S</italic>
is denoted by
<italic>L = </italic>
{(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
): ("
<italic>i</italic>
,
<italic>j</italic>
)(1 ≤
<italic>i</italic>
<
<italic>j</italic>
<italic>t</italic>
,
<italic>x</italic>
<sub>1</sub>
<italic>
<sub>l</sub>
s
<sub>i</sub>
</italic>
and
<italic>x</italic>
<sub>2</sub>
<italic>
<sub>l</sub>
s
<sub>j</sub>
</italic>
)}.</p>
<p>The aim of Step 1 is to extract as few pairs of
<italic>l</italic>
-mers as possible from
<italic>L</italic>
, and ensure that sufficient (more than half of) pairs of motif instances are included in them. We set a threshold
<italic>k</italic>
(0 ≤
<italic>k </italic>
<italic>l</italic>
), and then extract the pairs of
<italic>l</italic>
-mers (
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) from
<italic>L</italic>
with
<italic>d
<sub>H</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) ≤
<italic>k</italic>
. The set of the extracted pairs of
<italic>l</italic>
-mers is denoted by
<italic>L</italic>
<sub>1</sub>
<italic>=</italic>
{(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
): (
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
)∈
<italic>L</italic>
and
<italic>d
<sub>H</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) ≤
<italic>k</italic>
}.</p>
<p>For a proper choice of the threshold
<italic>k</italic>
, we consider two probabilities. One is the probability that the Hamming distance between two random
<italic>l</italic>
-mers is less than or equal to
<italic>k</italic>
, denoted by
<italic>p
<sub>k</sub>
</italic>
. The other is the probability that the Hamming distance between two randomly selected motif instances is less than or equal to
<italic>k</italic>
, denoted by
<italic>p</italic>
'
<italic>
<sub>k</sub>
</italic>
. The probability
<italic>p
<sub>k</sub>
</italic>
is calculated by (5).</p>
<disp-formula id="E5">
<inline-graphic xlink:href="ijbsv09p0412g05.jpg" mimetype="image"></inline-graphic>
<label>(5)</label>
</disp-formula>
<p>In order to calculate
<italic>p</italic>
'
<italic>
<sub>k</sub>
</italic>
, given a motif
<italic>m</italic>
and a motif instance
<italic>m</italic>
', more specific distance relation between
<italic>m</italic>
and
<italic>m</italic>
' is required besides 0 ≤
<italic>d
<sub>H</sub>
</italic>
(
<italic>m</italic>
,
<italic>m</italic>
') ≤
<italic>d</italic>
. We determine this relation by assuming that
<italic>m</italic>
' is obtained from
<italic>m</italic>
as follows: select
<italic>d </italic>
positions in
<italic>m</italic>
at random, and then replace each character at the selected positions with a random character in {A, G, C, T}. Since each of the selected
<italic>d</italic>
positions is changed with probability 3/4, the expectation of the distance between
<italic>m</italic>
and
<italic>m</italic>
' is 3
<italic>d</italic>
/4, the rationality of which will be analyzed in Discussion and Conclusions. On this basis, the probability that the Hamming distance between
<italic>m</italic>
and
<italic>m</italic>
' is
<italic>k</italic>
(0 ≤
<italic>k</italic>
<italic>d</italic>
) can be calculated by (6).</p>
<disp-formula id="E6">
<inline-graphic xlink:href="ijbsv09p0412g06.jpg" mimetype="image"></inline-graphic>
<label>(6)</label>
</disp-formula>
<p>Let
<italic>m</italic>
<sub>1</sub>
and
<italic>m</italic>
<sub>2</sub>
be two randomly selected instances of the motif
<italic>m</italic>
. <
<italic>d
<sub>H</sub>
</italic>
(
<italic>m</italic>
,
<italic>m</italic>
<sub>1</sub>
),
<italic>d
<sub>H</sub>
</italic>
(
<italic>m</italic>
,
<italic>m</italic>
<sub>2</sub>
)> represents the distance between
<italic>m</italic>
and the pair of motif instances (
<italic>m</italic>
<sub>1</sub>
,
<italic>m</italic>
<sub>2</sub>
), corresponding to a sample space
<italic>Ω</italic>
= {<
<italic>i</italic>
,
<italic>j</italic>
>: 0 ≤
<italic>i</italic>
<italic>d</italic>
, 0 ≤
<italic>j</italic>
<italic>d</italic>
}. Let
<italic>P</italic>
(<
<italic>i</italic>
,
<italic>j</italic>
>) denote the probability of <
<italic>d
<sub>H</sub>
</italic>
(
<italic>m</italic>
,
<italic>m</italic>
<sub>1</sub>
),
<italic>d
<sub>H</sub>
</italic>
(
<italic>m</italic>
,
<italic>m</italic>
<sub>2</sub>
)> = <
<italic>i</italic>
,
<italic>j</italic>
>. As
<italic>d
<sub>H</sub>
</italic>
(
<italic>m</italic>
,
<italic>m</italic>
<sub>1</sub>
) =
<italic>i</italic>
and
<italic>d
<sub>H</sub>
</italic>
(
<italic>m</italic>
,
<italic>m</italic>
<sub>2</sub>
) =
<italic>j</italic>
are independent with each other, we have:</p>
<disp-formula id="E7">
<italic>P</italic>
(<
<italic>i</italic>
,
<italic>j</italic>
>) =
<italic>P</italic>
(
<italic>d
<sub>H</sub>
</italic>
(
<italic>m</italic>
,
<italic>m</italic>
<sub>1</sub>
) =
<italic>i</italic>
) ×
<italic>P</italic>
(
<italic>d
<sub>H</sub>
</italic>
(
<italic>m</italic>
,
<italic>m</italic>
<sub>2</sub>
) =
<italic>j</italic>
)
<label>(7)</label>
</disp-formula>
<p>Based on the above equations,
<italic>p</italic>
'
<italic>
<sub>k</sub>
</italic>
can be calculated using the Theorem of Total Probability:</p>
<disp-formula id="E8">
<inline-graphic xlink:href="ijbsv09p0412g07.jpg" mimetype="image"></inline-graphic>
<label>(8)</label>
</disp-formula>
<p>In (8),
<italic>P</italic>
(
<italic>d
<sub>H</sub>
</italic>
(
<italic>m</italic>
<sub>1</sub>
,
<italic>m</italic>
<sub>2</sub>
) ≤
<italic>k | </italic>
<
<italic>i</italic>
,
<italic>j</italic>
>) represents the probability of
<italic>d
<sub>H</sub>
</italic>
(
<italic>m</italic>
<sub>1</sub>
,
<italic>m</italic>
<sub>2</sub>
) ≤
<italic>k</italic>
given <
<italic>d
<sub>H</sub>
</italic>
(
<italic>m</italic>
,
<italic>m</italic>
<sub>1</sub>
),
<italic>d
<sub>H</sub>
</italic>
(
<italic>m</italic>
,
<italic>m</italic>
<sub>2</sub>
)> = <
<italic>i</italic>
,
<italic>j</italic>
>. Its value can be calculated according to the actual situation. For example, for the PMS instance (15, 4),
<italic>P</italic>
(
<italic>d
<sub>H</sub>
</italic>
(
<italic>m</italic>
<sub>1</sub>
,
<italic>m</italic>
<sub>2</sub>
) ≤ 4
<italic> | </italic>
<2, 2>) = 1, and</p>
<disp-formula>
<inline-graphic xlink:href="ijbsv09p0412g08.jpg" mimetype="image"></inline-graphic>
</disp-formula>
<p>With
<italic>p</italic>
'
<italic>
<sub>k</sub>
</italic>
, it is easy to calculate the expectation of the number of extracted pairs of motif instances
<italic>E</italic>
[
<italic>N
<sub>m</sub>
</italic>
]. For the PMS problem, each input sequence contains a motif instance, so there are totally
<italic>t</italic>
(
<italic>t</italic>
-1)/2 pairs of motif instances. Moreover, in extracting pairs of
<italic>l</italic>
-mers with the restriction of the threshold
<italic>k</italic>
, each pair of motif instances has a probability of
<italic>p</italic>
'
<italic>
<sub>k</sub>
</italic>
to be extracted. Therefore,</p>
<disp-formula id="E9">
<italic>E</italic>
[
<italic>N
<sub>m</sub>
</italic>
] =
<italic>t</italic>
(
<italic>t</italic>
-1)/2 ×
<italic>p'
<sub>k</sub>
</italic>
<label>(9)</label>
</disp-formula>
<p>According to these two probabilities
<italic>p
<sub>k</sub>
</italic>
and
<italic>p</italic>
'
<italic>
<sub>k</sub>
</italic>
, we analyze how to set the threshold
<italic>k</italic>
for the given problem parameters
<italic>l</italic>
,
<italic>d</italic>
and
<italic>t</italic>
. Taking the PMS instance (15, 4) and
<italic>t</italic>
= 20 as an example, Table
<xref ref-type="table" rid="T1">1</xref>
gives the values of
<italic>p
<sub>k</sub>
</italic>
,
<italic>p</italic>
'
<italic>
<sub>k</sub>
</italic>
and
<italic>E</italic>
[
<italic>N
<sub>m</sub>
</italic>
] under different values of
<italic>k</italic>
. As mentioned above, we want to extract as few pairs of
<italic>l</italic>
-mers as possible while including sufficient pairs of motif instances. When
<italic>k</italic>
is 4, the value of
<italic>p
<sub>k</sub>
</italic>
is very small, which allows us to extract very few pairs of
<italic>l</italic>
-mers; however, the value of
<italic>p</italic>
'
<italic>
<sub>k</sub>
</italic>
is also so small that we cannot get sufficient pairs of motif instances. When
<italic>k</italic>
is 6 or a greater value, the value of
<italic>p</italic>
'
<italic>
<sub>k</sub>
</italic>
is large enough that more than 80% of pairs of motif instances are extracted, but the value of
<italic>p
<sub>k</sub>
</italic>
is also large, which is not conducive to reducing the scales of extracted pairs of
<italic>l</italic>
-mers. Therefore, it is appropriate to set
<italic>k</italic>
as 5 to perform extraction: on the one hand, only 0.08% of pairs of
<italic>l</italic>
-mers are extracted; on the other hand, nearly 60% of pairs of motif instances can be extracted. In doing so, we can not only reduce data scales greatly, but also retain sufficient motif information, providing a good foundation for the subsequent processing.</p>
</sec>
<sec>
<title>Step 2: Filtering Pairs of l-mers</title>
<p>This section discusses how to filter the extracted pairs of
<italic>l</italic>
-mers (i.e., the pairs of
<italic>l</italic>
-mers in
<italic>L</italic>
<sub>1</sub>
), in order to further reduce data scales and still retain one or more pairs of motif instances. Motifs often occur in sequences in a conservative form, and the similarity between motif instances is larger than that between most
<italic>l</italic>
-mers in background sequences. We define the weight of a pair of
<italic>l</italic>
-mers (
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) to be the similarity between (
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) and other
<italic>l</italic>
-mers. On this basis, we analyze the weight distribution of the pairs of
<italic>l</italic>
-mers in
<italic>L</italic>
<sub>1</sub>
, and then filter out the pairs of
<italic>l</italic>
-mers whose weights are small.</p>
<p>Given an
<italic>l</italic>
-mer
<italic>x</italic>
, its weight
<italic>w</italic>
(
<italic>x</italic>
) is calculated by (10). The larger the weight of
<italic>x</italic>
, the higher the similarity between
<italic>x</italic>
and other
<italic>l</italic>
-mers in
<italic>L</italic>
<sub>1</sub>
. In (10),
<italic>sim</italic>
(
<bold>·</bold>
) =
<italic> l </italic>
-
<italic>d
<sub>H</sub>
</italic>
(
<bold>·</bold>
), which represents the similarity between two
<italic>l</italic>
-mers.</p>
<disp-formula id="E10">
<inline-graphic xlink:href="ijbsv09p0412g09.jpg" mimetype="image"></inline-graphic>
<label>(10)</label>
</disp-formula>
<p>Based on (10), the weight of a pair of
<italic>l</italic>
-mers (
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
),
<italic>w</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
), is calculated as follows:</p>
<disp-formula id="E11">
<italic>w</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) =
<italic>w</italic>
(
<italic>x</italic>
<sub>1</sub>
) +
<italic>w</italic>
(
<italic>x</italic>
<sub>2</sub>
)
<label>(11)</label>
</disp-formula>
<p>For the PMS instances (15, 4) and (18, 6), we sample both the pairs of
<italic>l</italic>
-mers (i.e., all elements in
<italic>L</italic>
<sub>1</sub>
, including both the background and motif information) and the pairs of motif instances from
<italic>L</italic>
<sub>1</sub>
. Then we observe their weight distribution. The sampling process is: first, randomly generate 20 sequences of length 600 with each of them implanted a random motif instance; second, extract pairs of
<italic>l</italic>
-mers to form
<italic>L</italic>
<sub>1</sub>
, with
<italic>k</italic>
= 5 and 6 for the instances (15, 4) and (18, 6), respectively; third, sample the pairs of
<italic>l</italic>
-mers and the pairs of motif instances from
<italic>L</italic>
<sub>1</sub>
. Fig.
<xref ref-type="fig" rid="F2">2</xref>
shows the weight distribution (histogram) of the sampling data, which is the average of 10 times random sampling.</p>
<p>In the weight distribution of the pairs of
<italic>l</italic>
-mers, the pairs of motif instances are at the area where the weight is large, both for the PMS instance (15, 4) and (18, 6). That is, for the pairs in
<italic>L</italic>
<sub>1</sub>
, almost all of the ones with small weights are the pairs of background
<italic>l</italic>
-mers, whereas the ones with large weights correspond to both the pairs of motif instances and the pairs of background
<italic>l</italic>
-mers. By observing the histogram of pairs of
<italic>l</italic>
-mers, we can find that their weights are approximately distributed as a normal distribution. Let μ and σ denote the mean and the standard deviation of the weights of pairs of
<italic>l</italic>
-mers, respectively. We filter the pairs of
<italic>l</italic>
-mers in
<italic>L</italic>
<sub>1 </sub>
by making the remaining ones satisfy:</p>
<disp-formula id="E12">
<italic>w</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) > μ +
<italic>q</italic>
σ
<label>(12)</label>
</disp-formula>
<p>In (12), the parameter
<italic>q </italic>
(
<italic>q</italic>
≥ 0) indicates the filtering strength. To filter out more pairs of background
<italic>l</italic>
-mers, we should set
<italic>q</italic>
as high as possible and not remove all pairs of motif instances. For example, for the PMS instance (15, 4),
<italic>q</italic>
can be set to 4, and μ +
<italic>q</italic>
σ = 202 + 4 × 47 = 390. Then the pairs of
<italic>l</italic>
-mers in
<italic>L</italic>
<sub>1</sub>
with weight greater than 390 are retained and stored in the set
<italic>L</italic>
<sub>2</sub>
, namely
<italic>L</italic>
<sub>2 </sub>
<italic>= </italic>
{(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
): (
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
)∈
<italic>L</italic>
<sub>1</sub>
and
<italic>w</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) > μ +
<italic>q</italic>
σ}. Thus,
<italic>L</italic>
<sub>2</sub>
is composed of a (small) part of elements in
<italic>L</italic>
<sub>1</sub>
with a certain amount of motif information included.</p>
</sec>
<sec>
<title>Step 3: Refining Pairs of l-mers</title>
<p>The process of refining pairs of
<italic>l</italic>
-mers is to verify the candidate motifs derived from the pairs of
<italic>l</italic>
-mers in
<italic>L</italic>
<sub>2</sub>
one by one, and then report the motif with maximum score. Each candidate motif
<italic>y</italic>
is measured by Consensus score
<xref ref-type="bibr" rid="B19">19</xref>
:</p>
<disp-formula id="E13">
<inline-graphic xlink:href="ijbsv09p0412g10.jpg" mimetype="image"></inline-graphic>
<label>(13)</label>
</disp-formula>
<p>This section gives an approximate refinement strategy in order to avoid the verification of most candidate motifs with high accuracy. Specifically, in refining each pair of
<italic>l</italic>
-mers (
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
), instead of generating all subsets in the partition of
<italic>M
<sub>d</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) as we did in our recent work
<xref ref-type="bibr" rid="B29">29</xref>
, we just generate a part of subsets according to probabilistic analysis.</p>
<p>
<bold>Observation 1.</bold>
Given an (
<italic>l</italic>
,
<italic>d</italic>
) motif
<italic>y</italic>
and its two instances
<italic>x</italic>
<sub>1</sub>
and
<italic>x</italic>
<sub>2</sub>
with
<italic>d
<sub>sum</sub>
</italic>
denoting
<italic>d
<sub>H</sub>
</italic>
(
<italic>y</italic>
,
<italic>x</italic>
<sub>1</sub>
) +
<italic>d
<sub>H</sub>
</italic>
(
<italic>y</italic>
,
<italic>x</italic>
<sub>2</sub>
), the value of
<italic>d
<sub>sum</sub>
</italic>
ranges from
<italic>d
<sub>H</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) to 2
<italic>d</italic>
. The first column of Table
<xref ref-type="table" rid="T2">2</xref>
gives all the possible values of
<italic>d
<sub>sum</sub>
</italic>
for the PMS instance (15, 4), when
<italic>d
<sub>H</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) = 4.</p>
<p>Given a pair of
<italic>l</italic>
-mers (
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
), taking the PMS instance (15, 4) and
<italic>d
<sub>H</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) = 4 as an example, this table shows two values related to
<italic>d
<sub>sum</sub>
</italic>
, the basic to understand the approximate strategy for refining pairs of
<italic>l</italic>
-mers. One is the probability that a specific value of
<italic>d
<sub>sum</sub>
</italic>
occurs. The other is the number of candidate motifs generated from (
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) under a specific value of
<italic>d
<sub>sum</sub>
</italic>
. Moreover, this table shows mutually disjunct subsets of
<italic>R</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
), which are used to calculate the number of candidate motifs under different values of
<italic>d
<sub>sum</sub>
</italic>
. Note that,
<italic>R</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) is divided as follows. As mentioned previously,
<italic>d
<sub>sum</sub>
</italic>
= 2
<italic>α</italic>
+
<italic>β </italic>
+
<italic>d
<sub>H</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
), and thus the first inequality in (2) can be converted to 2
<italic>d</italic>
-
<italic>d
<sub>H</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) + 1 equations by Observation 1: 2
<italic>α</italic>
+
<italic>β </italic>
+
<italic>d
<sub>H</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) =
<italic>d
<sub>H</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
), … , 2
<italic>α</italic>
+
<italic>β </italic>
+
<italic>d
<sub>H</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) = 2
<italic>d</italic>
; then, each subset of
<italic>R</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) is obtained by solving a different equation.</p>
<p>We introduce the approximate refinement strategy by considering two values related to
<italic>d
<sub>sum</sub>
</italic>
, which are given in the third and fourth column of Table
<xref ref-type="table" rid="T2">2</xref>
.
<bold>One</bold>
is the probability that a specific value of
<italic>d
<sub>sum</sub>
</italic>
occurs, equal to the sum of the probability of all the possible samples in
<italic>Ω</italic>
. For example, when
<italic>d
<sub>sum</sub>
</italic>
= 7, the possible samples in
<italic>Ω</italic>
are <3, 4> and <4, 3>, and the probability that
<italic>d
<sub>sum</sub>
</italic>
= 7 occurs is equal to
<italic>P</italic>
(<3, 4>) +
<italic>P</italic>
(<4, 3>). To facilitate the analysis, the occurrence probability of
<italic>d
<sub>sum </sub>
</italic>
under different values is normalized.
<bold>The other</bold>
is the number of candidate motifs generated from (
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) under a specific value of
<italic>d
<sub>sum</sub>
</italic>
, which indicates the computational amount of refining (
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) under the specific value of
<italic>d
<sub>sum</sub>
</italic>
. According to different values of
<italic>d
<sub>sum</sub>
</italic>
,
<italic>R</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) can be divided into 2
<italic>d</italic>
-
<italic>d
<sub>H</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) + 1 mutually disjunct subsets, as shown in the second column of Table
<xref ref-type="table" rid="T2">2</xref>
. Thus, the candidate motifs generated under a specific value of
<italic>d
<sub>sum</sub>
</italic>
are those generated in terms of the associated subset of
<italic>R</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
); accordingly, we can calculate the number of candidate motifs under the specific value of
<italic> d
<sub>sum</sub>
</italic>
by (4).</p>
<p>As shown in Table
<xref ref-type="table" rid="T2">2</xref>
, the number of generated candidate motifs grows dramatically with the increase of the value of
<italic>d
<sub>sum</sub>
</italic>
. When
<italic>d
<sub>sum</sub>
</italic>
= 8, its occurrence probability is 0.11, but the associated candidate motifs account for 66.7% of the total amount. That is, the probability of finding the correct solution is only 0.11, but 66.7% of total computation amount is required. On the contrary, if we verify the associated candidate motifs when
<italic>d
<sub>sum</sub>
</italic>
is from 4 to 6, we only use 9.2% of total computation amount to find the correct solution with probability 0.7.</p>
<p>Taking these considerations into account, we use the following approximate strategy to reduce computational amount: verify the candidate motifs corresponding to the
<italic>d
<sub>sum </sub>
</italic>
with small value first. If the motif is found, then end the discovery process; otherwise, verify the candidate motifs corresponding to the
<italic>d
<sub>sum</sub>
</italic>
with large values gradually. Specifically, we first verify the candidate motifs in the subsets of
<italic>M
<sub>d</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) that correspond to the 2-tuples <
<italic>α</italic>
,
<italic>β</italic>
> satisfying (14).</p>
<disp-formula id="E14">2
<italic>α</italic>
+
<italic>β</italic>
+
<italic>d
<sub>H</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) ≤ 3
<italic>d</italic>
/2
<label>(14)</label>
</disp-formula>
<p>In (14), 2
<italic>α</italic>
+
<italic>β </italic>
+
<italic>d
<sub>H</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) represents
<italic>d
<sub>sum</sub>
</italic>
, and 3
<italic>d</italic>
/2 is the threshold set to calculate the 2-tuples <
<italic>α</italic>
,
<italic>β</italic>
>. There are two reasons why we use 3
<italic>d</italic>
/2 as the threshold. First, in terms of the description in Step 1, the expectation of the distance between a motif and its instance is 3
<italic>d</italic>
/4, so the expectation of
<italic>d
<sub>sum</sub>
</italic>
is 3
<italic>d</italic>
/2. Second, after the operation of Step 1 and 2, the motif instances in
<italic>L</italic>
<sub>2</sub>
have a relatively high similarity and they are close to the original motif, so the value of
<italic>d
<sub>sum</sub>
</italic>
that corresponds to motif instances is likely to be smaller than the expectation 3
<italic>d</italic>
/2.</p>
</sec>
<sec>
<title>PairMotif+</title>
<p>Based on the above three steps, the whole algorithm is described as follows:</p>
</sec>
<sec>
<title>Algorithm PairMotif+</title>
<p>
<bold>Input</bold>
:
<italic>l</italic>
,
<italic>d</italic>
,
<italic>S</italic>
= {
<italic>s</italic>
<sub>1</sub>
,
<italic>s</italic>
<sub>2</sub>
, … ,
<italic>s
<sub>t</sub>
</italic>
}</p>
<p>
<bold>Output</bold>
: a motif
<italic>m</italic>
</p>
<p>1:
<italic>L</italic>
<sub>1 </sub>
<italic>Ф</italic>
,
<italic>L</italic>
<sub>2 </sub>
<italic>Ф</italic>
,
<italic>maxScore </italic>
← 0, set the values of
<italic>k</italic>
and
<italic>q</italic>
</p>
<p>2:
<bold>for</bold>
each pair of
<italic>l</italic>
-mers (
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) in
<italic>S</italic>
<bold>do</bold>
</p>
<p>3:
<bold>if</bold>
<italic>d
<sub>H</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) ≤
<italic>k</italic>
<bold>then</bold>
add (
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) to
<italic>L</italic>
<sub>1</sub>
</p>
<p>4: Calculate μ and σ for weights of all pairs of
<italic>l</italic>
-mers in
<italic>L</italic>
<sub>1</sub>
</p>
<p>5:
<bold>for</bold>
each pair of
<italic>l</italic>
-mers (
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) ∈
<italic>L</italic>
<sub>1 </sub>
<bold>do</bold>
</p>
<p>6:
<bold>if</bold>
<italic>w</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) > μ +
<italic>q</italic>
σ
<bold>then</bold>
add (
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) to
<italic>L</italic>
<sub>2</sub>
</p>
<p>7:
<bold>for</bold>
each pair of
<italic>l</italic>
-mers (
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) ∈
<italic>L</italic>
<sub>2 </sub>
<bold>do</bold>
</p>
<p>8:
<bold>for</bold>
each <
<italic>α</italic>
,
<italic>β</italic>
> ∈
<italic>R</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
)
<bold>do</bold>
</p>
<p>9:
<bold>if</bold>
2
<italic>α</italic>
+
<italic>β </italic>
+
<italic>d
<sub>H</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) ≤ 3
<italic>d</italic>
/2
<bold>then</bold>
</p>
<p>10:
<bold>for</bold>
each
<italic>y </italic>
<italic>M
<sub>d</sub>
</italic>
<sub><
<italic>α</italic>
,
<italic>β</italic>
></sub>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
)
<bold>do</bold>
</p>
<p>11:
<bold>if</bold>
<italic>score</italic>
(
<italic>y</italic>
) >
<italic>maxScore</italic>
<bold>then</bold>
</p>
<p>12:
<italic>m</italic>
<italic>y</italic>
,
<italic>maxScore</italic>
=
<italic>score</italic>
(
<italic>y</italic>
)</p>
<p>13: Output
<italic>m</italic>
</p>
<p>Line 1 carries out the initialization. The values of
<italic>k</italic>
and
<italic>q</italic>
, which depend largely on
<italic>l </italic>
and
<italic>d</italic>
, are set according to the probabilistic analysis and statistical method described in Step 1 and 2, and their values under different (
<italic>l</italic>
,
<italic>d</italic>
) instances will be given in Results section. Lines 2 - 3, which are Step 1, extract pairs of
<italic>l</italic>
-mers from input sequences
<italic>S</italic>
with the restriction of the threshold
<italic>k</italic>
and store them in
<italic>L</italic>
<sub>1</sub>
. Lines 4 - 6, which perform Step 2, filter the pairs of
<italic>l</italic>
-mers in
<italic>L</italic>
<sub>1</sub>
according to the filtering strength
<italic>q</italic>
with remaining ones stored in
<italic>L</italic>
<sub>2</sub>
. Lines 7 - 13, which correspond to Step 3, verify each candidate motif derived from the pairs of
<italic>l</italic>
-mers in
<italic>L</italic>
<sub>2</sub>
and output the motif with maximum score.</p>
<p>The time complexity of PairMotif+ depends mainly on Step 3 (lines 7 - 13). First, let
<italic>N</italic>
denote the number of pairs of
<italic>l</italic>
-mers in
<italic>L</italic>
<sub>2</sub>
, whose order of magnitude is 10
<sup>2</sup>
or 10
<sup>3</sup>
in terms of the analysis in Step 2 and our experimental verification; since
<italic>N</italic>
is approximately equal to the sequence length
<italic>n</italic>
, we replace
<italic>N</italic>
with
<italic>n</italic>
in the time complexity. Second, for each pair of
<italic>l</italic>
-mers (
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) in
<italic>L</italic>
<sub>2</sub>
, the approximate refinement strategy makes the distance from a candidate motif to
<italic>x</italic>
<sub>1</sub>
or
<italic>x</italic>
<sub>2</sub>
usually less than or equal to 3
<italic>d</italic>
/4, and thus the probability that a random
<italic>l</italic>
-mer
<italic>y</italic>
becomes a candidate motif is Prob.[
<italic>d
<sub>H</sub>
</italic>
(
<italic>y</italic>
,
<italic>x</italic>
<sub>1</sub>
) ≤ 3
<italic>d</italic>
/4 &
<italic>d
<sub>H</sub>
</italic>
(
<italic>y</italic>
,
<italic>x</italic>
<sub>2</sub>
) ≤ 3
<italic>d</italic>
/4] =
<italic>p</italic>
<sup>2</sup>
<italic>
<sub>3d/</sub>
</italic>
<sub>4</sub>
; furthermore, the number of candidate motifs derived from (
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) is approximately equal to 4
<italic>
<sup>l</sup>
p</italic>
<sup>2</sup>
<italic>
<sub>3d/</sub>
</italic>
<sub>4</sub>
, where 4
<italic>
<sup>l</sup>
</italic>
is the number of all possible
<italic>l</italic>
-mers. Third, verifying each candidate motif
<italic>y</italic>
is to compare
<italic>y</italic>
with
<italic>O</italic>
(
<italic>tn</italic>
)
<italic>l</italic>
-mers in input sequences. Therefore, the expected time complexity of PairMotif+ is
<italic>O</italic>
(
<italic>tn</italic>
<sup>2</sup>
4
<italic>
<sup>l</sup>
p</italic>
<sup>2</sup>
<italic>
<sub>3d/</sub>
</italic>
<sub>4</sub>
).</p>
<p>The memory usage of PairMotif+ will reach its peak when Step 1 is being processed; accordingly, the space complexity of PairMotif+ depends on the number of pairs of
<italic>l</italic>
-mers in
<italic>L</italic>
<sub>1</sub>
. There are a total of
<italic>O</italic>
(
<italic>t</italic>
<sup>2</sup>
<italic>n</italic>
<sup>2</sup>
) pairs of
<italic>l</italic>
-mers in
<italic>S</italic>
and each pair has a probability of
<italic>p
<sub>k</sub>
</italic>
to be extracted, so the number of pairs of
<italic>l</italic>
-mers in
<italic>L</italic>
<sub>1</sub>
is
<italic>O</italic>
(
<italic>t</italic>
<sup>2</sup>
<italic>n</italic>
<sup>2</sup>
<italic>p
<sub>k</sub>
</italic>
). Note that, the memory usage in refining each pair of
<italic>l</italic>
-mers is negligible, because PairMotif+ does not generate the whole candidate motif set in Step 3. Actually, in traversing the candidate motif set, the algorithm verifies each candidate motif
<italic>y</italic>
immediately after
<italic>y</italic>
is generated, and then releases the associated storage space. Therefore, the space complexity of PairMotif+ is
<italic>O</italic>
(
<italic>t</italic>
<sup>2</sup>
<italic>n</italic>
<sup>2</sup>
<italic>p
<sub>k</sub>
</italic>
).</p>
</sec>
</sec>
<sec sec-type="results">
<title>Results</title>
<sec>
<title>Test on Simulated Data</title>
<p>Simulated data provide quantitative measures to compare the performance of PairMotif+ with the existing algorithms. We generate the simulated data sets following
<xref ref-type="bibr" rid="B1">1</xref>
: generate a motif of length
<italic>l</italic>
and
<italic>t </italic>
identically distributed sequences of length
<italic>n</italic>
; then, for each sequence
<italic>s</italic>
, implant a random motif instance, which differs from the motif in at most
<italic>d</italic>
positions, to a random position in
<italic>s</italic>
. To evaluate the prediction accuracy, we use the nucleotide level performance coefficient (
<italic>nPC</italic>
) following
<xref ref-type="bibr" rid="B30">30</xref>
, namely |
<italic>K</italic>
<italic>P</italic>
|/|
<italic>K</italic>
<italic>P</italic>
|, where
<italic>K</italic>
is the set of nucleotide positions corresponding to motif occurrences and
<italic>P</italic>
is the set of predicted nucleotides positions.</p>
<p>Several representative algorithms are selected to compare with PairMotif+, including MEME
<xref ref-type="bibr" rid="B6">6</xref>
, AlignACE
<xref ref-type="bibr" rid="B10">10</xref>
, VINE
<xref ref-type="bibr" rid="B18">18</xref>
and PairMotif
<xref ref-type="bibr" rid="B29">29</xref>
. MEME and AlignACE are the most popular motif recognition algorithms based on PWM; they were also involved in a comparison of different recognition algorithms in the review articles
<xref ref-type="bibr" rid="B30">30</xref>
and
<xref ref-type="bibr" rid="B31">31</xref>
. Vine, a recent method, is a polynomial-time heuristic algorithm based on graphical model, outperforming widely used approximate algorithms on the simulated data. PairMotif is a fast exact algorithm, capable of reporting all (
<italic>l</italic>
,
<italic>d</italic>
) motifs; its prediction accuracy is obtained by evaluating the (
<italic>l</italic>
,
<italic>d</italic>
) motif with maximum score. All algorithms are performed in the same experimental environment with a 2.67 GHz processor and a 4 Gbyte memory. The experimental results are the average derived by executing algorithms on five simulated data sets.</p>
<p>
<bold>First</bold>
, the comparisons are carried out on different PMS instances with fixed
<italic>t</italic>
= 20 and
<italic>n</italic>
= 600. As described above, 2
<italic>d</italic>
-neighborhood probability (
<italic>p</italic>
<sub>2
<italic>d</italic>
</sub>
), which can be calculated by (5), reflects the degree of degeneracy of a PMS instance. We select ten PMS instances with different value of
<italic>p</italic>
<sub>2
<italic>d</italic>
</sub>
as follows:
<italic>l</italic>
is less than or equal to 25, conforming to the general motif length;
<italic>d</italic>
is selected by setting the value of
<italic>p</italic>
<sub>2
<italic>d</italic>
</sub>
from 0.05 to 0.7, where 0.05 is approximately equal to the
<italic>p</italic>
<sub>2
<italic>d</italic>
</sub>
value of the classical PMS instance (15, 4), and the upper bound 0.7 makes the (
<italic>l</italic>
,
<italic>d</italic>
) motifs degenerate enough so that there is a lot of background noise.</p>
<p>Table
<xref ref-type="table" rid="T3">3</xref>
gives the prediction accuracy and running time of compared algorithms on these selected PMS instances. In PairMotif+, the parameters
<italic>k</italic>
and
<italic>q</italic>
are set according to the specific (
<italic>l</italic>
,
<italic>d</italic>
) instance. The threshold
<italic>k</italic>
for extracting pairs of
<italic>l</italic>
-mers increases with the increase of
<italic>l</italic>
so that sufficient pairs of motif instances can be extracted. The filtering strength
<italic>q</italic>
is related to
<italic>p</italic>
<sub>2
<italic>d</italic>
</sub>
; we decrease the value of
<italic>q</italic>
when
<italic>p</italic>
<sub>2
<italic>d</italic>
</sub>
is larger than 0.25, and thus we can still retain a certain amount of pairs of motif instances in the strong interference case. For these PMS instances, PairMotif+ can solve each of them within an hour, and its perdition accuracy is better than that of the compared approximate algorithms (MEME, AlignACE and VINE) and close to that of the exact algorithm (PairMotif). Although the exact algorithm can achieve the optimal solution, its computational cost is unrealistic for the PMS instances with large
<italic>p</italic>
<sub>2
<italic>d</italic>
</sub>
value. For example, in solving the instances (24, 8), (19, 7), (21, 8) and (23, 9), PairMotif requires a running time of more than five hours.</p>
<p>Among these tested PMS instances, (15, 5), (17, 6), (19, 7), (21, 8) and (23, 9) are challenging ones
<xref ref-type="bibr" rid="B1">1</xref>
. An instance is challenging if the input sequences are expected to contain one or more (
<italic>l</italic>
,
<italic>d</italic>
) motifs that occur by random chance. From the viewpoint of computational cost of exact algorithms, solving challenging instances with large
<italic>l</italic>
(
<italic>l</italic>
> 15) requires huge time overhead. PMS5
<xref ref-type="bibr" rid="B27">27</xref>
is an outstanding exact algorithm for solving challenging instances. Fig.
<xref ref-type="fig" rid="F3">3</xref>
shows the time overhead and prediction accuracy of PairMotif+ and PMS5 on these challenging instances. Compared with PMS5, PairMotif+ requires much less time overhead. Particularly, PairMotif+ can solve the instance (23, 9) within an hour, while the time overhead of PMS5 exceeds 40 hours. For the prediction accuracy, PairMotif+ shows a competitive performance: the accuracy of PairMotif+ is equal or close to that of PMS5 on different challenging instances. In the subsequent experiments, we no longer compare PairMotif+ with the exact algorithms.</p>
<p>
<bold>Second</bold>
, we carry out comparisons on different sequence length
<italic>n</italic>
by fixing the PMS instance as (15, 4) and
<italic>t</italic>
= 20. Fig.
<xref ref-type="fig" rid="F4">4</xref>
plots the prediction accuracy of compared algorithms against the increase of
<italic>n</italic>
, where
<italic>n</italic>
is from 200 to 2000. All the algorithms show the trend toward degradation in prediction accuracy, since the signal strength of motifs decreases gradually with the increase of
<italic>n</italic>
. In spite of this fact, we can find that PairMotif+ performs better than other algorithms: (1) PairMotif+ outperforms other algorithms on the whole prediction accuracy; (2) The prediction accuracy of PairMotif+ is relatively stable and decreases slowly, while all the other algorithms show a sharp decline in some cases, especially when
<italic>n</italic>
= 2000.</p>
<p>
<bold>Third</bold>
, the algorithms are compared on different number of planted motif instances, with fixed PMS instance (15, 4),
<italic>t</italic>
= 20 and
<italic>n</italic>
= 600. In reality, there may not exist motif instances in some input sequences, which increases the problem difficulty. To simulate this case, we only select a part of input sequences randomly with each of them implanted a motif instance. In this way, the number of selected sequences is equal to the total number of planted motif instances. The smaller the number of planted motif instances, the more difficult it is to discover the planted (
<italic>l</italic>
,
<italic>d</italic>
) motif. Fig.
<xref ref-type="fig" rid="F5">5</xref>
shows the prediction accuracy of compared algorithms by varying the number of planted motif instances form 20 to 11. Obviously, PairMotif+ has a better prediction accuracy than other algorithms; meanwhile, it shows a slow downward trend with decreasing the number of planted motif instances.</p>
</sec>
<sec>
<title>Test on Biological Data</title>
<p>For real biological data, the nucleotide composition of the sequences may be biased, so we use relative entropy to measure each candidate motif
<italic>y</italic>
:</p>
<disp-formula id="E15">
<inline-graphic xlink:href="ijbsv09p0412g15.jpg" mimetype="image"></inline-graphic>
<label>(15)</label>
</disp-formula>
<p>where
<italic>f
<sub>rj</sub>
</italic>
is the frequency of character
<italic>r</italic>
in position
<italic>j</italic>
in the occurrences of
<italic>y</italic>
and
<italic>b
<sub>r</sub>
</italic>
is the background frequency of character
<italic>r</italic>
. Relative entropy measures the difference between the motif nucleotide frequency and the background nucleotide frequency.</p>
<p>
<bold>At first</bold>
, PairMotif+ is tested on the widely used real data sets, including DHFR, c-fos, preproinsulin, metallothionein and Yeast ECB
<xref ref-type="bibr" rid="B1">1</xref>
, LexA
<xref ref-type="bibr" rid="B32">32</xref>
and E.coli CRP
<xref ref-type="bibr" rid="B33">33</xref>
. Each of these data sets corresponds to a specific (
<italic>l</italic>
,
<italic>d</italic>
) problem, because each sequence contains a motif instance and all motif instances of a motif have the same length. The purpose of testing on these data sets is to check whether the proposed algorithm can find known TFBSs using the specific (
<italic>l</italic>
,
<italic>d</italic>
), where
<italic>l</italic>
is the length of the published motif and
<italic>d</italic>
is set to make 2
<italic>d</italic>
equal the maximum Hamming distance between different motif instances (binding sites).</p>
<p>Table
<xref ref-type="table" rid="T4">4</xref>
gives the used parameters and the predicted motifs. The threshold
<italic>k</italic>
is set with respect to
<italic>l</italic>
, consistent with the values in Table
<xref ref-type="table" rid="T3">3</xref>
. For the filtering strength
<italic>q</italic>
, besides the value of
<italic>p</italic>
<sub>2
<italic>d</italic>
</sub>
, it is also determined by the number of input sequences. To avoid filtering out all pairs of motif instances, we set
<italic>q</italic>
as 0 when the number of input sequences is small (≤ 6). The underlined part of each predicted motif represents the part overlapped with the published motif. We can see that PairMotif+ works well for all of these data sets. Particularly, for the data sets c-fos, metallothionein, Yeast ECB and E.COLI CRP, PairMotif+ achieves accurate predictions. Also, Fig.
<xref ref-type="fig" rid="F6">6</xref>
shows sequence logos
<xref ref-type="bibr" rid="B34">34</xref>
of the predicted motifs, which graphically shows the degree of motif conservation measured by relative entropy. Note that, many existing recognition algorithms
<xref ref-type="bibr" rid="B1">1</xref>
,
<xref ref-type="bibr" rid="B3">3</xref>
-
<xref ref-type="bibr" rid="B5">5</xref>
,
<xref ref-type="bibr" rid="B18">18</xref>
,
<xref ref-type="bibr" rid="B29">29</xref>
also test their validity on these data sets. Since all of these algorithms (including PairMotif+) show a good performance on these data sets, here we do not make comparisons.</p>
<p>
<bold>Moreover</bold>
, we give the prediction performance of PairMotif+ on Tompa data
<xref ref-type="bibr" rid="B30">30</xref>
, which provides a group of standard data sets to evaluate the newly designed algorithms. In three types of Tompa data, we choose the data of real type, including 52 data sets obtained from the TRANSFAC database and involving four species: human (hm), mouse (mus), Drosophila melanogaster (dm) and Saccharomyces cerevisiae (yst). Furthermore, we select 31 out of the 52 data sets: we do not consider the data sets that only contain one or two sequences because PairMotif+ requires at least three input sequences; we only select the hm data sets of length 500, since the length of most hm data sets is so long that it is difficult to make effective predictions. Specifically, the selected hm data sets are hm06r, hm08r, hm10r, hm17r, hm19r, hm22r, hm23r and hm24r; the selected mus data sets are mus01r, mus02r, mus03r, mus04r, mus05r, mus06r, mus07r, mus08r, mus10r, mus11r and mus12r; the selected dm data sets are dm01r, dm03r, dm04r and dm05r; the selected yst data sets are yst01r, yst02r, yst03r, yst04r, yst05r, yst06r, yst08r and yst09r.</p>
<p>We obtain the predicted motifs and calculate the prediction accuracy (
<italic>nPC</italic>
) as follows. For each data set, since the motif length is not known in advance, we obtain eight predicted motifs, with each one having a different length ranging from 9 to 16. For each predicted motif of length
<italic>l</italic>
, it is obtained by running PairMotif+ on the most degenerate (
<italic>l</italic>
,
<italic>d</italic>
) instance with the
<italic>p</italic>
<sub>2
<italic>d</italic>
</sub>
value less than 0.7. The threshold
<italic>k</italic>
is set to 2, 3, 4 and 5 when the motif length is 9 and 10, 11 and 12, 13 and 14, and 15 and 16, respectively. The filtering strength
<italic>q</italic>
is set to 0. In the eight predicted motifs, the one most close to TFBSs is selected to calculate the prediction accuracy. To better show the results, we take MEME as a reference algorithm and calculate its prediction accuracy in the same way. The reason why we choose MEME is: MEME is a mature and widely used tool and is able to report multiple motifs quickly under a given length range, whereas few of the other algorithms can do so.</p>
<p>Fig.
<xref ref-type="fig" rid="F7">7</xref>
gives the comprehensive performance of PairMotif+ and MEME on each species of Tompa data by showing two values. One is the valid prediction rate, namely the ratio of
<italic>N
<sub>valid</sub>
</italic>
to
<italic>N
<sub>all</sub>
</italic>
, where
<italic>N
<sub>valid</sub>
</italic>
denotes the number of data sets on which the prediction accuracy is nonzero, and
<italic>N
<sub>all</sub>
</italic>
denotes the number of all data sets. The valid prediction rate indicates the adaptability of an algorithm on a specific species. The other is the average of prediction accuracy on all data sets, which represents the prediction ability of an algorithm on a specific species. PairMotif+ is comparable with MEME for both the valid prediction rate and the average of prediction accuracy. Particularly, the valid prediction rate of PairMotif+ is better than that of MEME on the dm and yst species; for the average of prediction accuracy, PairMotif+ outperforms MEME on all species except for dm.</p>
<p>More detailed results on Tompa data are shown in Fig.
<xref ref-type="fig" rid="F8">8</xref>
by plotting the prediction accuracy of PairMotif+ and MEME on each data set. We can find that the prediction accuracy of PairMotif+ is better than that of MEME on some data sets (i.e., hm17r, hm22r, etc.), but worse than on the other data sets (i.e., hm06r, hm08r, etc.). For this phenomenon, there exists realistic meaning for identifying TFBSs. The predicted motifs of different algorithms need to be complemented with each other, since motif discovery algorithms show a poor ability to identify TFBSs in higher eukaryotes
<xref ref-type="bibr" rid="B18">18</xref>
,
<xref ref-type="bibr" rid="B30">30</xref>
. Combining the results of different algorithms is conducive to improving the prediction accuracy and the related research corresponds to ensemble algorithms
<xref ref-type="bibr" rid="B31">31</xref>
. From the perspective of ensemble research, PairMotif+ provides a good candidate for the selection of fundamental algorithms.</p>
</sec>
</sec>
<sec>
<title>Discussion and Conclusions</title>
<p>Buhler and Tompa introduced a formal description of the motif search problem in 2002, the planted (
<italic>l</italic>
,
<italic>d</italic>
) motif search (PMS)
<xref ref-type="bibr" rid="B1">1</xref>
. In the next year, Evans et al. proved the NP-hardness of the PMS problem by analyzing the complexity of finding common approximate substrings
<xref ref-type="bibr" rid="B2">2</xref>
. The initial version of PMS assumed that there are exactly
<italic>d</italic>
different positions between a motif and a motif instance. To more effectively predict motifs in real biological data, in recent years researches (including us) have begun to focus on an improved version where a motif instance differs from the associated motif in at most
<italic>d</italic>
positions.</p>
<p>Numerous algorithms, either exact or approximate, have been proposed to identify (
<italic>l</italic>
,
<italic>d</italic>
) motifs. The principle of the exact algorithms is to report all (
<italic>l</italic>
,
<italic>d</italic>
) motifs and the optimal one using as little time as possible. In our previous work, we proposed an exact algorithm named PairMotif
<xref ref-type="bibr" rid="B29">29</xref>
. PairMotif is able to quickly solve many PMS instances except for the challenging ones with large
<italic>l</italic>
, such as (21, 8) and (23, 9). To the best of our knowledge, PMS5
<xref ref-type="bibr" rid="B27">27</xref>
is the fastest exact algorithm for solving challenging instances with large
<italic>l</italic>
, but its time overhead is still far from satisfactory. The aim of most approximate recognition algorithms is to get as good results as possible in a short time, such as MEME
<xref ref-type="bibr" rid="B6">6</xref>
, which always returns results within several seconds. In solving some PMS instances, such as (15, 4) and (18, 6), MEME achieves a good prediction accuracy. However, MEME, as well as many other approximate algorithms, shows a poor ability to identify highly degenerate motifs.</p>
<p>In the present study, we aim to make a good trade-off between prediction accuracy and time performance for motif search. Specifically, our goal is to use a reasonable time (within an hour on personal computers) to obtain results with high accuracy. This goal is achieved by designing a new algorithm called PairMotif+ using the strategy of PairMotif: extract some pairs of
<italic>l</italic>
-mers from input sequences
<italic>S</italic>
, and then refine each of them. Unlike PairMotif, PairMotif+ completes these tasks using the method based on probabilistic analysis rather than the exhaustive search.</p>
<p>From the theoretical perspective, PairMotif+ guarantees both a good time performance (efficiency) and a good prediction accuracy (validity). To get good time performance, we extract pairs of
<italic>l</italic>
-mers from
<italic>S</italic>
with the restriction of the threshold
<italic>k</italic>
and further filter them using the filtering strength
<italic>q</italic>
. After these operations, the number of pairs to be refined by PairMotif+ is about
<italic>O</italic>
(
<italic>n</italic>
), far less than the number of pairs processed by PariMotif,
<italic>O</italic>
(
<italic>n</italic>
<sup>2</sup>
). Moreover, in refining pairs of
<italic>l</italic>
-mers, unlike PairMotif that verifies all possible candidate motifs, PairMotif+ adopts an approximate refinement strategy and avoids the verification of most candidate motifs.</p>
<p>To achieve good prediction accuracy, first, the pairs of
<italic>l</italic>
-mers to be refined should contain at least one pair of motif instances, and the key point is to set the parameters
<italic>k</italic>
and
<italic>q</italic>
according to probabilistic analysis and statistical method; second, in refining pairs of
<italic>l</italic>
-mers, the generated candidate motifs should contain the desired motif, and the key point is to determine which part of subsets in the partition of candidate motifs should be generated in terms of probabilistic analysis. The foundation required by all of these work is the distance relation between a motif
<italic>m</italic>
and its instance
<italic>m</italic>
'. The basic distance relation that
<italic>m</italic>
and
<italic>m</italic>
' differ by at most
<italic>d</italic>
positions is not specific enough to carry out probabilistic analysis. Therefore, based on the basic relation, we adopt a more specific version, namely the expectation of the distance between
<italic>m</italic>
and
<italic>m</italic>
' is 3
<italic>d</italic>
/4, which allows us to quantitatively analyze how to set appropriate parameters. The choice of this expectation is reasonable: if the expectation is too small, then our algorithm can only identify the highly conserved motifs and lacks a good scalability; if the expectation is
<italic>d</italic>
, it is not consistent with the practical biological case and will also decrease the time performance of our algorithm.</p>
<p>Experimental results also demonstrate the efficiency and validity of PairMotif+. From the results on simulated data, we can find: (1) PairMotif+ is able to solve various PMS instances within an hour on personal computers. Particularly, all instances except for (21, 8) and (23, 9) are solved within several seconds to several minutes. (2) The prediction accuracy of PairMotif+ is better than that of the compared algorithms and close to the optimal solution. (3) PairMotif+ shows a stable prediction as the sequence length is increased. (4) It is easy to extend PairMotif+ to solve the motif search problem that is not in the case of OOPS (one motif occurrence per sequence), and the prediction accuracy is stable over different number of planted motif instances. Moreover, for the experiments on real biological data, we use two groups of data sets: (1) The first group includes DHFR, c-fos, preproinsulin, metallothionein and Yeast ECB
<xref ref-type="bibr" rid="B1">1</xref>
, LexA
<xref ref-type="bibr" rid="B32">32</xref>
and E.coli CRP
<xref ref-type="bibr" rid="B33">33</xref>
, which are used by many existing recognition algorithms to test their validity. For each of these data sets, PairMotif+ is able to find all or a large part of TFBSs. (2) The second group of data sets is the Tompa data
<xref ref-type="bibr" rid="B30">30</xref>
, the standard data sets to evaluate the newly designed recognition algorithms. The comprehensive performance of PairMotif+ is comparable with that of the mature and popular algorithm MEME.</p>
<p>In summary, we have proposed a new approximate algorithm for the PMS problem and tested it on both simulated data and real biological data. This algorithm is good at identifying highly degenerate motifs, and outperforms the compared algorithms in identification accuracy. Although the execution time increases with the increase of the motif length, which is determined by the used pattern-driven framework, the proposed algorithm is able to solve various PMS instances within an hour on personal computers.</p>
</sec>
</body>
<back>
<ack>
<p>This work was supported by the National Natural Science Foundation of China (61173025), the Research Fund for the Doctoral Program of Higher Education of China (20100203110010), and the Fundamental Research Funds for the Central Universities (K50513100011 and K5051303002).</p>
</ack>
<ref-list>
<ref id="B1">
<label>1</label>
<element-citation publication-type="journal">
<name>
<surname>Buhler</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Tompa</surname>
<given-names>M</given-names>
</name>
<article-title>Finding motifs using random projections</article-title>
<source>J Comput Biol</source>
<year>2002</year>
<volume>9</volume>
<fpage>225</fpage>
<lpage>42</lpage>
<pub-id pub-id-type="pmid">12015879</pub-id>
</element-citation>
</ref>
<ref id="B2">
<label>2</label>
<element-citation publication-type="journal">
<name>
<surname>Evans</surname>
<given-names>PA</given-names>
</name>
<name>
<surname>Smith</surname>
<given-names>AD</given-names>
</name>
<name>
<surname>Wareham</surname>
<given-names>HT</given-names>
</name>
<article-title>On the complexity of finding common approximate substrings</article-title>
<source>Theor Comput Sci</source>
<year>2003</year>
<volume>306</volume>
<fpage>407</fpage>
<lpage>30</lpage>
</element-citation>
</ref>
<ref id="B3">
<label>3</label>
<element-citation publication-type="journal">
<name>
<surname>Ho</surname>
<given-names>ES</given-names>
</name>
<name>
<surname>Jakubowski</surname>
<given-names>CD</given-names>
</name>
<name>
<surname>Gunderson</surname>
<given-names>SI</given-names>
</name>
<article-title>iTriplet, a rule-based nucleic acid sequence motif finder</article-title>
<source>Algorithm Mol Biol</source>
<year>2009</year>
<volume>4</volume>
<fpage>14</fpage>
</element-citation>
</ref>
<ref id="B4">
<label>4</label>
<element-citation publication-type="journal">
<name>
<surname>Sun</surname>
<given-names>HQ</given-names>
</name>
<name>
<surname>Low</surname>
<given-names>MYH</given-names>
</name>
<name>
<surname>Hsu</surname>
<given-names>WJ</given-names>
</name>
<etal></etal>
<article-title>RecMotif: a novel fast algorithm for weak motif discovery</article-title>
<source>BMC Bioinformatics</source>
<year>2010</year>
<volume>11</volume>
<issue>Suppl 11</issue>
<fpage>S8</fpage>
<pub-id pub-id-type="pmid">21172058</pub-id>
</element-citation>
</ref>
<ref id="B5">
<label>5</label>
<element-citation publication-type="journal">
<name>
<surname>Davila</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Balla</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Rajasekaran</surname>
<given-names>S</given-names>
</name>
<article-title>Fast and practical algorithms for planted (
<italic>l</italic>
,
<italic>d</italic>
) motif search</article-title>
<source>IEEE ACM T Comput Bi</source>
<year>2007</year>
<volume>4</volume>
<fpage>544</fpage>
<lpage>52</lpage>
</element-citation>
</ref>
<ref id="B6">
<label>6</label>
<element-citation publication-type="journal">
<name>
<surname>Bailey</surname>
<given-names>TL</given-names>
</name>
<name>
<surname>Williams</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Misleh</surname>
<given-names>C</given-names>
</name>
<etal></etal>
<article-title>MEME: discovering and analyzing DNA and protein sequence motifs</article-title>
<source>Nucleic Acids Res</source>
<year>2006</year>
<volume>34</volume>
<fpage>369</fpage>
<lpage>73</lpage>
</element-citation>
</ref>
<ref id="B7">
<label>7</label>
<element-citation publication-type="journal">
<name>
<surname>Lawrence</surname>
<given-names>CE</given-names>
</name>
<name>
<surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name>
<surname>Boguski</surname>
<given-names>MS</given-names>
</name>
<etal></etal>
<article-title>Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment</article-title>
<source>Science</source>
<year>1993</year>
<volume>262</volume>
<fpage>208</fpage>
<lpage>14</lpage>
<pub-id pub-id-type="pmid">8211139</pub-id>
</element-citation>
</ref>
<ref id="B8">
<label>8</label>
<element-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>L</given-names>
</name>
<article-title>GADEM: a genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery</article-title>
<source>J Comput Biol</source>
<year>2009</year>
<volume>16</volume>
<fpage>317</fpage>
<lpage>29</lpage>
<pub-id pub-id-type="pmid">19193149</pub-id>
</element-citation>
</ref>
<ref id="B9">
<label>9</label>
<element-citation publication-type="journal">
<name>
<surname>Neuwald</surname>
<given-names>AF</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>JS</given-names>
</name>
<name>
<surname>Lawrence</surname>
<given-names>CE</given-names>
</name>
<article-title>Gibbs motif sampling: detection of bacterial outer membrane protein repeats</article-title>
<source>Protein Sci</source>
<year>1995</year>
<volume>4</volume>
<fpage>1618</fpage>
<lpage>32</lpage>
<pub-id pub-id-type="pmid">8520488</pub-id>
</element-citation>
</ref>
<ref id="B10">
<label>10</label>
<element-citation publication-type="journal">
<name>
<surname>Hughes</surname>
<given-names>JD</given-names>
</name>
<name>
<surname>Estep</surname>
<given-names>PW</given-names>
</name>
<name>
<surname>Tavazoie</surname>
<given-names>S</given-names>
</name>
<etal></etal>
<article-title>Computational identification of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae</article-title>
<source>J Mol Biol</source>
<year>2000</year>
<volume>296</volume>
<fpage>1205</fpage>
<lpage>14</lpage>
<pub-id pub-id-type="pmid">10698627</pub-id>
</element-citation>
</ref>
<ref id="B11">
<label>11</label>
<element-citation publication-type="journal">
<name>
<surname>Thijs</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Lescot</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Marchal</surname>
<given-names>K</given-names>
</name>
<etal></etal>
<article-title>A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling</article-title>
<source>Bioinformatics</source>
<year>2001</year>
<volume>17</volume>
<fpage>1113</fpage>
<lpage>22</lpage>
<pub-id pub-id-type="pmid">11751219</pub-id>
</element-citation>
</ref>
<ref id="B12">
<label>12</label>
<element-citation publication-type="journal">
<name>
<surname>Tang</surname>
<given-names>ME</given-names>
</name>
<name>
<surname>Krogh</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Winther</surname>
<given-names>O</given-names>
</name>
<article-title>BayesMD: flexible biological modeling for motif discovery</article-title>
<source>J Comput Biol</source>
<year>2008</year>
<volume>15</volume>
<fpage>1347</fpage>
<lpage>63</lpage>
<pub-id pub-id-type="pmid">19040368</pub-id>
</element-citation>
</ref>
<ref id="B13">
<label>13</label>
<element-citation publication-type="journal">
<name>
<surname>Kim</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Tharakaraman</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Mariño-Ramírez</surname>
<given-names>L</given-names>
</name>
<etal></etal>
<article-title>Finding sequence motifs with Bayesian models incorporating positional information: an application to transcription factor binding sites</article-title>
<source>BMC Bioinformatics</source>
<year>2008</year>
<volume>9</volume>
<fpage>262</fpage>
<pub-id pub-id-type="pmid">18533028</pub-id>
</element-citation>
</ref>
<ref id="B14">
<label>14</label>
<element-citation publication-type="journal">
<name>
<surname>Miller</surname>
<given-names>AK</given-names>
</name>
<name>
<surname>Print</surname>
<given-names>CG</given-names>
</name>
<name>
<surname>Nielsen</surname>
<given-names>PMF</given-names>
</name>
<etal></etal>
<article-title>A Bayesian search for transcriptional motifs</article-title>
<source>PLoS One</source>
<year>2010</year>
<volume>5</volume>
<fpage>e13897</fpage>
<pub-id pub-id-type="pmid">21124986</pub-id>
</element-citation>
</ref>
<ref id="B15">
<label>15</label>
<element-citation publication-type="journal">
<name>
<surname>Jajamovich</surname>
<given-names>GH</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>XD</given-names>
</name>
<name>
<surname>Arkin</surname>
<given-names>AP</given-names>
</name>
<etal></etal>
<article-title>Bayesian multiple-instance motif discovery with BAMBI: inference of recombinase and transcription factor binding sites</article-title>
<source>Nucleic Acids Res</source>
<year>2011</year>
<volume>39</volume>
<fpage>e146</fpage>
<pub-id pub-id-type="pmid">21948794</pub-id>
</element-citation>
</ref>
<ref id="B16">
<label>16</label>
<element-citation publication-type="journal">
<name>
<surname>Fratkin</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Naughton</surname>
<given-names>BT</given-names>
</name>
<name>
<surname>Brutlag</surname>
<given-names>DL</given-names>
</name>
<etal></etal>
<article-title>MotifCut: regulatory motifs finding with maximum density subgraphs</article-title>
<source>Bioinformatics</source>
<year>2006</year>
<volume>22</volume>
<fpage>e150</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="pmid">16873465</pub-id>
</element-citation>
</ref>
<ref id="B17">
<label>17</label>
<element-citation publication-type="journal">
<name>
<surname>Boucher</surname>
<given-names>C</given-names>
</name>
<name>
<surname>King</surname>
<given-names>J</given-names>
</name>
<article-title>Fast motif recognition via application of statistical thresholds</article-title>
<source>BMC Bioinformatics</source>
<year>2010</year>
<volume>11</volume>
<issue>Suppl 1</issue>
<fpage>S11</fpage>
<pub-id pub-id-type="pmid">20122182</pub-id>
</element-citation>
</ref>
<ref id="B18">
<label>18</label>
<element-citation publication-type="journal">
<name>
<surname>Huang</surname>
<given-names>CW</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>WS</given-names>
</name>
<name>
<surname>Hsieh</surname>
<given-names>SY</given-names>
</name>
<article-title>An improved heuristic algorithm for finding motif signals in DNA sequences</article-title>
<source>IEEE ACM T Comput Bi</source>
<year>2011</year>
<volume>8</volume>
<fpage>959</fpage>
<lpage>75</lpage>
</element-citation>
</ref>
<ref id="B19">
<label>19</label>
<element-citation publication-type="book">
<name>
<surname>Jones</surname>
<given-names>NC</given-names>
</name>
<name>
<surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
<source>An introduction to bioinformatics algorithms</source>
<publisher-loc>Cambridge</publisher-loc>
<publisher-name>MIT Press</publisher-name>
<year>2004</year>
</element-citation>
</ref>
<ref id="B20">
<label>20</label>
<element-citation publication-type="journal">
<name>
<surname>Yang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Rajapakse</surname>
<given-names>JC</given-names>
</name>
<article-title>Graphical approach to weak motif recognition</article-title>
<source>Genome Inform</source>
<year>2004</year>
<volume>15</volume>
<fpage>52</fpage>
<lpage>62</lpage>
<pub-id pub-id-type="pmid">15706491</pub-id>
</element-citation>
</ref>
<ref id="B21">
<label>21</label>
<element-citation publication-type="journal">
<name>
<surname>Vanet</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Marsan</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Labigne</surname>
<given-names>A</given-names>
</name>
<etal></etal>
<article-title>Inferring regulatory elements from a whole genome: an analysis of helicobacter pylori σ
<sup>80</sup>
family of promoter signals</article-title>
<source>J Mol Biol</source>
<year>2000</year>
<volume>297</volume>
<fpage>335</fpage>
<lpage>53</lpage>
<pub-id pub-id-type="pmid">10715205</pub-id>
</element-citation>
</ref>
<ref id="B22">
<label>22</label>
<element-citation publication-type="journal">
<name>
<surname>Marsan</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Sagot</surname>
<given-names>M</given-names>
</name>
<article-title>Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification</article-title>
<source>J Comput Biol</source>
<year>2000</year>
<volume>7</volume>
<fpage>345</fpage>
<lpage>62</lpage>
<pub-id pub-id-type="pmid">11108467</pub-id>
</element-citation>
</ref>
<ref id="B23">
<label>23</label>
<element-citation publication-type="journal">
<name>
<surname>Pavesi</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Mauri</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Pesole</surname>
<given-names>G</given-names>
</name>
<article-title>An algorithm for finding signals of unknown length in DNA sequences</article-title>
<source>Bioinformatics</source>
<year>2001</year>
<volume>17</volume>
<fpage>207</fpage>
<lpage>14</lpage>
</element-citation>
</ref>
<ref id="B24">
<label>24</label>
<element-citation publication-type="journal">
<name>
<surname>Eskin</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
<article-title>Finding composite regulatory patterns in DNA sequences</article-title>
<source>Bioinformatics</source>
<year>2002</year>
<volume>18</volume>
<fpage>354</fpage>
<lpage>63</lpage>
</element-citation>
</ref>
<ref id="B25">
<label>25</label>
<element-citation publication-type="journal">
<name>
<surname>Rajasekaran</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Balla</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>CH</given-names>
</name>
<article-title>Exact algorithms for planted motif problems</article-title>
<source>J Comput Biol</source>
<year>2005</year>
<volume>12</volume>
<fpage>1117</fpage>
<lpage>28</lpage>
<pub-id pub-id-type="pmid">16241901</pub-id>
</element-citation>
</ref>
<ref id="B26">
<label>26</label>
<element-citation publication-type="journal">
<name>
<surname>Rajasekaran</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Balla</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>CH</given-names>
</name>
<etal></etal>
<article-title>High-Performance exact algorithms for motif search</article-title>
<source>J Clin Monit Comput</source>
<year>2005</year>
<volume>19</volume>
<fpage>319</fpage>
<lpage>28</lpage>
<pub-id pub-id-type="pmid">16328946</pub-id>
</element-citation>
</ref>
<ref id="B27">
<label>27</label>
<element-citation publication-type="journal">
<name>
<surname>Dinh</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Rajasekaran</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Kundeti</surname>
<given-names>VK</given-names>
</name>
<article-title>PMS5: an efficient exact algorithm for the (
<italic>l</italic>
,
<italic>d</italic>
)-motif finding problem</article-title>
<source>BMC Bioinformatics</source>
<year>2011</year>
<volume>12</volume>
<fpage>410</fpage>
<pub-id pub-id-type="pmid">22024209</pub-id>
</element-citation>
</ref>
<ref id="B28">
<label>28</label>
<element-citation publication-type="journal">
<name>
<surname>Dinh</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Rajasekaran</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Davila</surname>
<given-names>J</given-names>
</name>
<article-title>qPMS7: a fast algorithm for finding (
<italic>l</italic>
,
<italic>d</italic>
)-motifs in DNA and protein sequences</article-title>
<source>PLoS One</source>
<year>2012</year>
<volume>7</volume>
<fpage>e41425</fpage>
<pub-id pub-id-type="pmid">22848493</pub-id>
</element-citation>
</ref>
<ref id="B29">
<label>29</label>
<element-citation publication-type="journal">
<name>
<surname>Yu</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Huo</surname>
<given-names>HW</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>YP</given-names>
</name>
<etal></etal>
<article-title>PairMotif: a new pattern-driven algorithm for planted (
<italic>l</italic>
,
<italic>d</italic>
) DNA motif search</article-title>
<source>PLoS One</source>
<year>2012</year>
<volume>7</volume>
<fpage>e48442</fpage>
<pub-id pub-id-type="pmid">23119020</pub-id>
</element-citation>
</ref>
<ref id="B30">
<label>30</label>
<element-citation publication-type="journal">
<name>
<surname>Tompa</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Bailey</surname>
<given-names>TL</given-names>
</name>
<etal></etal>
<article-title>Assessing computational tools for the discovery of transcription factor binding sites</article-title>
<source>Nat Biotechnol</source>
<year>2005</year>
<volume>23</volume>
<fpage>137</fpage>
<lpage>44</lpage>
<pub-id pub-id-type="pmid">15637633</pub-id>
</element-citation>
</ref>
<ref id="B31">
<label>31</label>
<element-citation publication-type="journal">
<name>
<surname>Hu</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Kihara</surname>
<given-names>D</given-names>
</name>
<article-title>Limitations and potentials of current motif discovery algorithms</article-title>
<source>Nucleic Acids Res</source>
<year>2005</year>
<volume>33</volume>
<fpage>4899</fpage>
<lpage>913</lpage>
<pub-id pub-id-type="pmid">16284194</pub-id>
</element-citation>
</ref>
<ref id="B32">
<label>32</label>
<element-citation publication-type="journal">
<name>
<surname>Hertz</surname>
<given-names>GZ</given-names>
</name>
<name>
<surname>Hartzell</surname>
<given-names>GW</given-names>
</name>
<article-title>Identification of consensus patterns in unaligned DNA sequences known to be functionally related</article-title>
<source>Comput Appl Biosci</source>
<year>1990</year>
<volume>6</volume>
<fpage>81</fpage>
<lpage>92</lpage>
<pub-id pub-id-type="pmid">2193692</pub-id>
</element-citation>
</ref>
<ref id="B33">
<label>33</label>
<element-citation publication-type="journal">
<name>
<surname>Stormo</surname>
<given-names>GD</given-names>
</name>
<name>
<surname>Hartzell</surname>
<given-names>GW</given-names>
</name>
<article-title>Identifying protein-binding sites from unaligned DNA fragments</article-title>
<source>Proc Natl Acad Sci USA</source>
<year>1989</year>
<volume>86</volume>
<fpage>1183</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="pmid">2919167</pub-id>
</element-citation>
</ref>
<ref id="B34">
<label>34</label>
<element-citation publication-type="journal">
<name>
<surname>Crooks</surname>
<given-names>GE</given-names>
</name>
<name>
<surname>Hon</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Chandonia</surname>
<given-names>JM</given-names>
</name>
<etal></etal>
<article-title>WebLogo: a sequence Logo generator</article-title>
<source>Genome Res</source>
<year>2004</year>
<volume>14</volume>
<fpage>1188</fpage>
<lpage>90</lpage>
<pub-id pub-id-type="pmid">15173120</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
<floats-group>
<fig id="F1" position="float">
<label>Fig 1</label>
<caption>
<p>An example for partitioning positions in the alignment of three l-mers.</p>
</caption>
<graphic xlink:href="ijbsv09p0412g01"></graphic>
</fig>
<fig id="F2" position="float">
<label>Fig 2</label>
<caption>
<p>Weight distribution of pairs of l-mers and pairs of motif instances in L
<sub>1</sub>
. (A) l = 15 and d = 4. (B) l = 18 and d = 6.</p>
</caption>
<graphic xlink:href="ijbsv09p0412g11"></graphic>
</fig>
<fig id="F3" position="float">
<label>Fig 3</label>
<caption>
<p>Comparisons on challenging PMS instances.</p>
</caption>
<graphic xlink:href="ijbsv09p0412g12"></graphic>
</fig>
<fig id="F4" position="float">
<label>Fig 4</label>
<caption>
<p>Comparisons on different sequence length.</p>
</caption>
<graphic xlink:href="ijbsv09p0412g13"></graphic>
</fig>
<fig id="F5" position="float">
<label>Fig 5</label>
<caption>
<p>Comparisons on different number of planted motif instances.</p>
</caption>
<graphic xlink:href="ijbsv09p0412g14"></graphic>
</fig>
<fig id="F6" position="float">
<label>Fig 6</label>
<caption>
<p>Sequence logos of predicted motifs.</p>
</caption>
<graphic xlink:href="ijbsv09p0412g16"></graphic>
</fig>
<fig id="F7" position="float">
<label>Fig 7</label>
<caption>
<p>Comprehensive performance on each species of Tompa data.</p>
</caption>
<graphic xlink:href="ijbsv09p0412g17"></graphic>
</fig>
<fig id="F8" position="float">
<label>Fig 8</label>
<caption>
<p>Detailed prediction accuracy on Tompa data.</p>
</caption>
<graphic xlink:href="ijbsv09p0412g18"></graphic>
</fig>
<table-wrap id="T1" position="float">
<label>Table 1</label>
<caption>
<p>
<italic>p
<sub>k</sub>
</italic>
,
<italic> p</italic>
'
<italic>
<sub>k</sub>
</italic>
and
<italic>E</italic>
[
<italic>N
<sub>m</sub>
</italic>
] under different values of
<italic>k</italic>
for the PMS instance (15, 4).</p>
</caption>
<table frame="hsides" rules="groups">
<thead valign="top">
<tr>
<th rowspan="1" colspan="1">k</th>
<th rowspan="1" colspan="1">4</th>
<th rowspan="1" colspan="1">5</th>
<th rowspan="1" colspan="1">6</th>
<th rowspan="1" colspan="1">7</th>
<th rowspan="1" colspan="1">8</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td rowspan="1" colspan="1">p
<sub>k</sub>
</td>
<td rowspan="1" colspan="1">0.0001</td>
<td rowspan="1" colspan="1">0.0008</td>
<td rowspan="1" colspan="1">0.0042</td>
<td rowspan="1" colspan="1">0.0173</td>
<td rowspan="1" colspan="1">0.0570</td>
</tr>
<tr>
<td rowspan="1" colspan="1">p'
<sub>k</sub>
</td>
<td rowspan="1" colspan="1">0.3461</td>
<td rowspan="1" colspan="1">0.5845</td>
<td rowspan="1" colspan="1">0.8469</td>
<td rowspan="1" colspan="1">0.9242</td>
<td rowspan="1" colspan="1">1.0000</td>
</tr>
<tr>
<td rowspan="1" colspan="1">E[N
<sub>m</sub>
]</td>
<td rowspan="1" colspan="1">65.759</td>
<td rowspan="1" colspan="1">111.06</td>
<td rowspan="1" colspan="1">160.91</td>
<td rowspan="1" colspan="1">175.60</td>
<td rowspan="1" colspan="1">190.00</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T2" position="float">
<label>Table 2</label>
<caption>
<p>Two values related to
<italic>d
<sub>sum</sub>
</italic>
under the instance (15, 4) and
<italic>d
<sub>H</sub>
</italic>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) = 4.</p>
</caption>
<table frame="hsides" rules="groups">
<thead valign="top">
<tr>
<th rowspan="1" colspan="1">d
<sub>sum</sub>
</th>
<th rowspan="1" colspan="1">Associated subsets of R(x
<sub>1</sub>
, x
<sub>2</sub>
)</th>
<th rowspan="1" colspan="1">Occurrence probability of d
<sub>sum</sub>
</th>
<th rowspan="1" colspan="1">Number of candidate motifs (Percentage)</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td rowspan="1" colspan="1">8</td>
<td rowspan="1" colspan="1">{<0,4>, <1,2>, <2,0>}</td>
<td rowspan="1" colspan="1">0.11</td>
<td rowspan="1" colspan="1">4570 (66.7%)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">7</td>
<td rowspan="1" colspan="1">{<0,3>, <1,1>}</td>
<td rowspan="1" colspan="1">0.19</td>
<td rowspan="1" colspan="1">1648 (24.0%)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">6</td>
<td rowspan="1" colspan="1">{<0,2>, <1,0>}</td>
<td rowspan="1" colspan="1">0.36</td>
<td rowspan="1" colspan="1">558 (8.1%)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">{<0,1>}</td>
<td rowspan="1" colspan="1">0.24</td>
<td rowspan="1" colspan="1">64 (0.9%)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1">{<0,0>}</td>
<td rowspan="1" colspan="1">0.10</td>
<td rowspan="1" colspan="1">14 (0.2%)</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T3" position="float">
<label>Table 3</label>
<caption>
<p>Comparisons on PMS instances with different 2
<italic>d</italic>
-neighborhood probability.</p>
</caption>
<table frame="hsides" rules="groups">
<thead valign="top">
<tr>
<th rowspan="1" colspan="1">(l, d)</th>
<th rowspan="1" colspan="1">p
<sub>2d</sub>
</th>
<th rowspan="1" colspan="1">k</th>
<th rowspan="1" colspan="1">q</th>
<th rowspan="1" colspan="1">PairMotif+</th>
<th rowspan="1" colspan="1">MEME</th>
<th rowspan="1" colspan="1">AlignACE</th>
<th rowspan="1" colspan="1">VINE</th>
<th rowspan="1" colspan="1">PairMotif</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td rowspan="1" colspan="1">(15, 4)</td>
<td rowspan="1" colspan="1">0.057</td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1">1.00 (2s)</td>
<td rowspan="1" colspan="1">0.93 (6s)</td>
<td rowspan="1" colspan="1">0.64 (4.3m)</td>
<td rowspan="1" colspan="1">0.98 (7.1m)</td>
<td rowspan="1" colspan="1">1.00 (2s)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">(14, 4)</td>
<td rowspan="1" colspan="1">0.112</td>
<td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1">0.94 (2s)</td>
<td rowspan="1" colspan="1">0.77 (6s)</td>
<td rowspan="1" colspan="1">0.59 (2.7m)</td>
<td rowspan="1" colspan="1">0.91 (8.4m)</td>
<td rowspan="1" colspan="1">0.96 (14s)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">(25, 8)</td>
<td rowspan="1" colspan="1">0.149</td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1">1.00 (2.4m)</td>
<td rowspan="1" colspan="1">1.00 (6s)</td>
<td rowspan="1" colspan="1">0.97 (2.5m)</td>
<td rowspan="1" colspan="1">1.00 (9.6m)</td>
<td rowspan="1" colspan="1">1.00 (52.3m)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">(24, 8)</td>
<td rowspan="1" colspan="1">0.234</td>
<td rowspan="1" colspan="1">9</td>
<td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1">1.00 (3.0m)</td>
<td rowspan="1" colspan="1">0.98 (6s)</td>
<td rowspan="1" colspan="1">0.86 (2.2m)</td>
<td rowspan="1" colspan="1">0.98(12.2m)</td>
<td rowspan="1" colspan="1">> 5h</td>
</tr>
<tr>
<td rowspan="1" colspan="1">(18, 6)</td>
<td rowspan="1" colspan="1">0.283</td>
<td rowspan="1" colspan="1">6</td>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">1.00 (14s)</td>
<td rowspan="1" colspan="1">0.89 (6s)</td>
<td rowspan="1" colspan="1">0.51 (2.1m)</td>
<td rowspan="1" colspan="1">1.00 (9.3m)</td>
<td rowspan="1" colspan="1">1.00 (12.1m)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">(15, 5)</td>
<td rowspan="1" colspan="1">0.319</td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">0.95 (3s)</td>
<td rowspan="1" colspan="1">0.76 (6s)</td>
<td rowspan="1" colspan="1">0.43 (2.2m)</td>
<td rowspan="1" colspan="1">0.70 (8.7m)</td>
<td rowspan="1" colspan="1">0.95 (4.7m)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">(17, 6)</td>
<td rowspan="1" colspan="1">0.426</td>
<td rowspan="1" colspan="1">6</td>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">0.90 (26s)</td>
<td rowspan="1" colspan="1">0.66 (6s)</td>
<td rowspan="1" colspan="1">0.40 (3.6m)</td>
<td rowspan="1" colspan="1">0.80 (9.5m)</td>
<td rowspan="1" colspan="1">0.93 (53.3m)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">(19, 7)</td>
<td rowspan="1" colspan="1">0.534</td>
<td rowspan="1" colspan="1">7</td>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">0.96 (58s)</td>
<td rowspan="1" colspan="1">0.56 (6s)</td>
<td rowspan="1" colspan="1">0.42 (3.2m)</td>
<td rowspan="1" colspan="1">0.76 (10.1m)</td>
<td rowspan="1" colspan="1">> 5h</td>
</tr>
<tr>
<td rowspan="1" colspan="1">(21, 8)</td>
<td rowspan="1" colspan="1">0.633</td>
<td rowspan="1" colspan="1">8</td>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">0.94 (18.1m)</td>
<td rowspan="1" colspan="1">0.68 (6s)</td>
<td rowspan="1" colspan="1">0.48 (3.2m)</td>
<td rowspan="1" colspan="1">0.88 (13.4m)</td>
<td rowspan="1" colspan="1">> 5h</td>
</tr>
<tr>
<td rowspan="1" colspan="1">(23, 9)</td>
<td rowspan="1" colspan="1">0.698</td>
<td rowspan="1" colspan="1">9</td>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">0.98 (47.9m)</td>
<td rowspan="1" colspan="1">0.76 (6s)</td>
<td rowspan="1" colspan="1">0.53 (3.5m)</td>
<td rowspan="1" colspan="1">0.85 (15.2m)</td>
<td rowspan="1" colspan="1">> 5h</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>Time units, s: seconds; m: minutes; h: hours.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<table-wrap id="T4" position="float">
<label>Table 4</label>
<caption>
<p>Results on several widely used real data sets.</p>
</caption>
<table frame="hsides" rules="groups">
<thead valign="top">
<tr>
<th rowspan="1" colspan="1">Data (# of sequences)</th>
<th rowspan="1" colspan="1">(
<italic>l, d</italic>
)</th>
<th rowspan="1" colspan="1">
<italic>K</italic>
</th>
<th rowspan="1" colspan="1">
<italic>q</italic>
</th>
<th rowspan="1" colspan="1">Predicted motifs</th>
<th rowspan="1" colspan="1">Published motifs</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td rowspan="1" colspan="1">DHFR (4)</td>
<td rowspan="1" colspan="1">(11, 3)</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">
<underline>GCGCCA</underline>
AACTT</td>
<td rowspan="1" colspan="1">ATTTCGCGCCA</td>
</tr>
<tr>
<td rowspan="1" colspan="1">c-fos (6)</td>
<td rowspan="1" colspan="1">(16, 4)</td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">
<underline>CCATTTTAGGACATCT</underline>
</td>
<td rowspan="1" colspan="1">CCATATTAGGACATCT</td>
</tr>
<tr>
<td rowspan="1" colspan="1">preproinsulin (4)</td>
<td rowspan="1" colspan="1">(15, 4)</td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">TG
<underline>CAACCTCAGCCCC</underline>
</td>
<td rowspan="1" colspan="1">CAGCCTCAGCCCCCA</td>
</tr>
<tr>
<td rowspan="1" colspan="1">metallothionein (4)</td>
<td rowspan="1" colspan="1">(15, 4)</td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">
<underline>CTCTGCACCCGGCCC</underline>
</td>
<td rowspan="1" colspan="1">CTCTGCACRCCGCCC</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Yeast ECB (5)</td>
<td rowspan="1" colspan="1">(16, 5)</td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">
<underline>TTACCCAGTAAGGAAA</underline>
</td>
<td rowspan="1" colspan="1">TTTCCCNNTNAGGAAA</td>
</tr>
<tr>
<td rowspan="1" colspan="1">LexA (16)</td>
<td rowspan="1" colspan="1">(20, 7)</td>
<td rowspan="1" colspan="1">7</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">A
<underline>TACTGTATATGCATTCAAC</underline>
</td>
<td rowspan="1" colspan="1">TACTGTATATATATACAGTA</td>
</tr>
<tr>
<td rowspan="1" colspan="1">E.coli CRP (18)</td>
<td rowspan="1" colspan="1">(16, 7)</td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">
<underline>TGTGAACGAGTTCACA</underline>
</td>
<td rowspan="1" colspan="1">TGTGANNNNGNTCACA</td>
</tr>
</tbody>
</table>
</table-wrap>
</floats-group>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000D340 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000D340 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021