Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

PairMotifChIP: A Fast Algorithm for Discovery of Patterns Conserved in Large ChIP-seq Data Sets

Identifieur interne : 000B32 ( Pmc/Corpus ); précédent : 000B31; suivant : 000B33

PairMotifChIP: A Fast Algorithm for Discovery of Patterns Conserved in Large ChIP-seq Data Sets

Auteurs : Qiang Yu ; Hongwei Huo ; Dazheng Feng

Source :

RBID : PMC:5098105

Abstract

Identifying conserved patterns in DNA sequences, namely, motif discovery, is an important and challenging computational task. With hundreds or more sequences contained, the high-throughput sequencing data set is helpful to improve the identification accuracy of motif discovery but requires an even higher computing performance. To efficiently identify motifs in large DNA data sets, a new algorithm called PairMotifChIP is proposed by extracting and combining pairs of l-mers in the input with relatively small Hamming distance. In particular, a method for rapidly extracting pairs of l-mers is designed, which can be used not only for PairMotifChIP, but also for other DNA data mining tasks with the same demand. Experimental results on the simulated data show that the proposed algorithm can find motifs successfully and runs faster than the state-of-the-art motif discovery algorithms. Furthermore, the validity of the proposed algorithm has been verified on real data.


Url:
DOI: 10.1155/2016/4986707
PubMed: 27843946
PubMed Central: 5098105

Links to Exploration step

PMC:5098105

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">PairMotifChIP: A Fast Algorithm for Discovery of Patterns Conserved in Large ChIP-seq Data Sets</title>
<author>
<name sortKey="Yu, Qiang" sort="Yu, Qiang" uniqKey="Yu Q" first="Qiang" last="Yu">Qiang Yu</name>
<affiliation>
<nlm:aff id="I1">School of Computer Science and Technology, Xidian University, Xi'an 710071, China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Huo, Hongwei" sort="Huo, Hongwei" uniqKey="Huo H" first="Hongwei" last="Huo">Hongwei Huo</name>
<affiliation>
<nlm:aff id="I1">School of Computer Science and Technology, Xidian University, Xi'an 710071, China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Feng, Dazheng" sort="Feng, Dazheng" uniqKey="Feng D" first="Dazheng" last="Feng">Dazheng Feng</name>
<affiliation>
<nlm:aff id="I2">School of Electronic Engineering, Xidian University, Xi'an 710071, China</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">27843946</idno>
<idno type="pmc">5098105</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5098105</idno>
<idno type="RBID">PMC:5098105</idno>
<idno type="doi">10.1155/2016/4986707</idno>
<date when="2016">2016</date>
<idno type="wicri:Area/Pmc/Corpus">000B32</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000B32</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">PairMotifChIP: A Fast Algorithm for Discovery of Patterns Conserved in Large ChIP-seq Data Sets</title>
<author>
<name sortKey="Yu, Qiang" sort="Yu, Qiang" uniqKey="Yu Q" first="Qiang" last="Yu">Qiang Yu</name>
<affiliation>
<nlm:aff id="I1">School of Computer Science and Technology, Xidian University, Xi'an 710071, China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Huo, Hongwei" sort="Huo, Hongwei" uniqKey="Huo H" first="Hongwei" last="Huo">Hongwei Huo</name>
<affiliation>
<nlm:aff id="I1">School of Computer Science and Technology, Xidian University, Xi'an 710071, China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Feng, Dazheng" sort="Feng, Dazheng" uniqKey="Feng D" first="Dazheng" last="Feng">Dazheng Feng</name>
<affiliation>
<nlm:aff id="I2">School of Electronic Engineering, Xidian University, Xi'an 710071, China</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BioMed Research International</title>
<idno type="ISSN">2314-6133</idno>
<idno type="eISSN">2314-6141</idno>
<imprint>
<date when="2016">2016</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>Identifying conserved patterns in DNA sequences, namely, motif discovery, is an important and challenging computational task. With hundreds or more sequences contained, the high-throughput sequencing data set is helpful to improve the identification accuracy of motif discovery but requires an even higher computing performance. To efficiently identify motifs in large DNA data sets, a new algorithm called PairMotifChIP is proposed by extracting and combining pairs of
<italic>l</italic>
-mers in the input with relatively small Hamming distance. In particular, a method for rapidly extracting pairs of
<italic>l</italic>
-mers is designed, which can be used not only for PairMotifChIP, but also for other DNA data mining tasks with the same demand. Experimental results on the simulated data show that the proposed algorithm can find motifs successfully and runs faster than the state-of-the-art motif discovery algorithms. Furthermore, the validity of the proposed algorithm has been verified on real data.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="D Haeseleer, P" uniqKey="D Haeseleer P">P. D'Haeseleer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Matys, V" uniqKey="Matys V">V. Matys</name>
</author>
<author>
<name sortKey="Fricke, E" uniqKey="Fricke E">E. Fricke</name>
</author>
<author>
<name sortKey="Geffers, R" uniqKey="Geffers R">R. Geffers</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Guhathakurta, D" uniqKey="Guhathakurta D">D. GuhaThakurta</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mathelier, A" uniqKey="Mathelier A">A. Mathelier</name>
</author>
<author>
<name sortKey="Shi, W" uniqKey="Shi W">W. Shi</name>
</author>
<author>
<name sortKey="Wasserman, W W" uniqKey="Wasserman W">W. W. Wasserman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huang, C W" uniqKey="Huang C">C.-W. Huang</name>
</author>
<author>
<name sortKey="Lee, W S" uniqKey="Lee W">W.-S. Lee</name>
</author>
<author>
<name sortKey="Hsieh, S Y" uniqKey="Hsieh S">S.-Y. Hsieh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Evans, P A" uniqKey="Evans P">P. A. Evans</name>
</author>
<author>
<name sortKey="Smith, A D" uniqKey="Smith A">A. D. Smith</name>
</author>
<author>
<name sortKey="Wareham, H T" uniqKey="Wareham H">H. T. Wareham</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zambelli, F" uniqKey="Zambelli F">F. Zambelli</name>
</author>
<author>
<name sortKey="Pesole, G" uniqKey="Pesole G">G. Pesole</name>
</author>
<author>
<name sortKey="Pavesi, G" uniqKey="Pavesi G">G. Pavesi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mardis, E R" uniqKey="Mardis E">E. R. Mardis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H. Li</name>
</author>
<author>
<name sortKey="Homer, N" uniqKey="Homer N">N. Homer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kim, H" uniqKey="Kim H">H. Kim</name>
</author>
<author>
<name sortKey="Kim, J" uniqKey="Kim J">J. Kim</name>
</author>
<author>
<name sortKey="Selby, H" uniqKey="Selby H">H. Selby</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pavesi, G" uniqKey="Pavesi G">G. Pavesi</name>
</author>
<author>
<name sortKey="Mereghetti, P" uniqKey="Mereghetti P">P. Mereghetti</name>
</author>
<author>
<name sortKey="Mauri, G" uniqKey="Mauri G">G. Mauri</name>
</author>
<author>
<name sortKey="Pesole, G" uniqKey="Pesole G">G. Pesole</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yu, Q" uniqKey="Yu Q">Q. Yu</name>
</author>
<author>
<name sortKey="Huo, H" uniqKey="Huo H">H. Huo</name>
</author>
<author>
<name sortKey="Zhang, Y" uniqKey="Zhang Y">Y. Zhang</name>
</author>
<author>
<name sortKey="Guo, H" uniqKey="Guo H">H. Guo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yu, Q" uniqKey="Yu Q">Q. Yu</name>
</author>
<author>
<name sortKey="Huo, H" uniqKey="Huo H">H. Huo</name>
</author>
<author>
<name sortKey="Zhang, Y" uniqKey="Zhang Y">Y. Zhang</name>
</author>
<author>
<name sortKey="Guo, H" uniqKey="Guo H">H. Guo</name>
</author>
<author>
<name sortKey="Guo, H" uniqKey="Guo H">H. Guo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bailey, T L" uniqKey="Bailey T">T. L. Bailey</name>
</author>
<author>
<name sortKey="Williams, N" uniqKey="Williams N">N. Williams</name>
</author>
<author>
<name sortKey="Misleh, C" uniqKey="Misleh C">C. Misleh</name>
</author>
<author>
<name sortKey="Li, W W" uniqKey="Li W">W. W. Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nicolae, M" uniqKey="Nicolae M">M. Nicolae</name>
</author>
<author>
<name sortKey="Rajasekaran, S" uniqKey="Rajasekaran S">S. Rajasekaran</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nicolae, M" uniqKey="Nicolae M">M. Nicolae</name>
</author>
<author>
<name sortKey="Rajasekaran, S" uniqKey="Rajasekaran S">S. Rajasekaran</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Das, M K" uniqKey="Das M">M. K. Das</name>
</author>
<author>
<name sortKey="Dai, H K" uniqKey="Dai H">H.-K. Dai</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jia, C" uniqKey="Jia C">C. Jia</name>
</author>
<author>
<name sortKey="Carson, M B" uniqKey="Carson M">M. B. Carson</name>
</author>
<author>
<name sortKey="Wang, Y" uniqKey="Wang Y">Y. Wang</name>
</author>
<author>
<name sortKey="Lin, Y" uniqKey="Lin Y">Y. Lin</name>
</author>
<author>
<name sortKey="Lu, H" uniqKey="Lu H">H. Lu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zambelli, F" uniqKey="Zambelli F">F. Zambelli</name>
</author>
<author>
<name sortKey="Pavesi, G" uniqKey="Pavesi G">G. Pavesi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Thomas Chollier, M" uniqKey="Thomas Chollier M">M. Thomas-Chollier</name>
</author>
<author>
<name sortKey="Herrmann, C" uniqKey="Herrmann C">C. Herrmann</name>
</author>
<author>
<name sortKey="Defrance, M" uniqKey="Defrance M">M. Defrance</name>
</author>
<author>
<name sortKey="Sand, O" uniqKey="Sand O">O. Sand</name>
</author>
<author>
<name sortKey="Thieffry, D" uniqKey="Thieffry D">D. Thieffry</name>
</author>
<author>
<name sortKey="Van Helden, J" uniqKey="Van Helden J">J. van Helden</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sharov, A A" uniqKey="Sharov A">A. A. Sharov</name>
</author>
<author>
<name sortKey="Ko, M S H" uniqKey="Ko M">M. S. H. Ko</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yu, Q" uniqKey="Yu Q">Q. Yu</name>
</author>
<author>
<name sortKey="Huo, H" uniqKey="Huo H">H. Huo</name>
</author>
<author>
<name sortKey="Chen, X" uniqKey="Chen X">X. Chen</name>
</author>
<author>
<name sortKey="Guo, H" uniqKey="Guo H">H. Guo</name>
</author>
<author>
<name sortKey="Vitter, J S" uniqKey="Vitter J">J. S. Vitter</name>
</author>
<author>
<name sortKey="Huan, J" uniqKey="Huan J">J. Huan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Machanick, P" uniqKey="Machanick P">P. Machanick</name>
</author>
<author>
<name sortKey="Bailey, T L" uniqKey="Bailey T">T. L. Bailey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Reid, J E" uniqKey="Reid J">J. E. Reid</name>
</author>
<author>
<name sortKey="Wernisch, L" uniqKey="Wernisch L">L. Wernisch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lihu, A" uniqKey="Lihu A">A. Lihu</name>
</author>
<author>
<name sortKey="Holban, " uniqKey="Holban ">Ş. Holban</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yu, Q" uniqKey="Yu Q">Q. Yu</name>
</author>
<author>
<name sortKey="Huo, H" uniqKey="Huo H">H. Huo</name>
</author>
<author>
<name sortKey="Zhao, R" uniqKey="Zhao R">R. Zhao</name>
</author>
<author>
<name sortKey="Feng, D" uniqKey="Feng D">D. Feng</name>
</author>
<author>
<name sortKey="Vitter, J S" uniqKey="Vitter J">J. S. Vitter</name>
</author>
<author>
<name sortKey="Huan, J" uniqKey="Huan J">J. Huan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Buhler, J" uniqKey="Buhler J">J. Buhler</name>
</author>
<author>
<name sortKey="Tompa, M" uniqKey="Tompa M">M. Tompa</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Van Dongen, S" uniqKey="Van Dongen S">S. van Dongen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pevzner, P A" uniqKey="Pevzner P">P. A. Pevzner</name>
</author>
<author>
<name sortKey="Sze, S H" uniqKey="Sze S">S. H. Sze</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, X" uniqKey="Chen X">X. Chen</name>
</author>
<author>
<name sortKey="Xu, H" uniqKey="Xu H">H. Xu</name>
</author>
<author>
<name sortKey="Yuan, P" uniqKey="Yuan P">P. Yuan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tompa, M" uniqKey="Tompa M">M. Tompa</name>
</author>
<author>
<name sortKey="Li, N" uniqKey="Li N">N. Li</name>
</author>
<author>
<name sortKey="Bailey, T L" uniqKey="Bailey T">T. L. Bailey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Crooks, G E" uniqKey="Crooks G">G. E. Crooks</name>
</author>
<author>
<name sortKey="Hon, G" uniqKey="Hon G">G. Hon</name>
</author>
<author>
<name sortKey="Chandonia, J M" uniqKey="Chandonia J">J.-M. Chandonia</name>
</author>
<author>
<name sortKey="Brenner, S E" uniqKey="Brenner S">S. E. Brenner</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Biomed Res Int</journal-id>
<journal-id journal-id-type="iso-abbrev">Biomed Res Int</journal-id>
<journal-id journal-id-type="publisher-id">BMRI</journal-id>
<journal-title-group>
<journal-title>BioMed Research International</journal-title>
</journal-title-group>
<issn pub-type="ppub">2314-6133</issn>
<issn pub-type="epub">2314-6141</issn>
<publisher>
<publisher-name>Hindawi Publishing Corporation</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">27843946</article-id>
<article-id pub-id-type="pmc">5098105</article-id>
<article-id pub-id-type="doi">10.1155/2016/4986707</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>PairMotifChIP: A Fast Algorithm for Discovery of Patterns Conserved in Large ChIP-seq Data Sets</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0002-8725-6687</contrib-id>
<name>
<surname>Yu</surname>
<given-names>Qiang</given-names>
</name>
<xref ref-type="aff" rid="I1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="cor1">
<sup>*</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Huo</surname>
<given-names>Hongwei</given-names>
</name>
<xref ref-type="aff" rid="I1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Feng</surname>
<given-names>Dazheng</given-names>
</name>
<xref ref-type="aff" rid="I2">
<sup>2</sup>
</xref>
</contrib>
</contrib-group>
<aff id="I1">
<sup>1</sup>
School of Computer Science and Technology, Xidian University, Xi'an 710071, China</aff>
<aff id="I2">
<sup>2</sup>
School of Electronic Engineering, Xidian University, Xi'an 710071, China</aff>
<author-notes>
<corresp id="cor1">*Qiang Yu:
<email>qyu@mail.xidian.edu.cn</email>
</corresp>
<fn fn-type="other">
<p>Academic Editor: Yudong Cai</p>
</fn>
</author-notes>
<pub-date pub-type="ppub">
<year>2016</year>
</pub-date>
<pub-date pub-type="epub">
<day>24</day>
<month>10</month>
<year>2016</year>
</pub-date>
<volume>2016</volume>
<elocation-id>4986707</elocation-id>
<history>
<date date-type="received">
<day>22</day>
<month>6</month>
<year>2016</year>
</date>
<date date-type="rev-recd">
<day>4</day>
<month>9</month>
<year>2016</year>
</date>
<date date-type="accepted">
<day>27</day>
<month>9</month>
<year>2016</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright © 2016 Qiang Yu et al.</copyright-statement>
<copyright-year>2016</copyright-year>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract>
<p>Identifying conserved patterns in DNA sequences, namely, motif discovery, is an important and challenging computational task. With hundreds or more sequences contained, the high-throughput sequencing data set is helpful to improve the identification accuracy of motif discovery but requires an even higher computing performance. To efficiently identify motifs in large DNA data sets, a new algorithm called PairMotifChIP is proposed by extracting and combining pairs of
<italic>l</italic>
-mers in the input with relatively small Hamming distance. In particular, a method for rapidly extracting pairs of
<italic>l</italic>
-mers is designed, which can be used not only for PairMotifChIP, but also for other DNA data mining tasks with the same demand. Experimental results on the simulated data show that the proposed algorithm can find motifs successfully and runs faster than the state-of-the-art motif discovery algorithms. Furthermore, the validity of the proposed algorithm has been verified on real data.</p>
</abstract>
<funding-group>
<award-group>
<funding-source xlink:href="http://dx.doi.org/10.13039/501100001809">National Natural Science Foundation of China</funding-source>
<award-id>61502366</award-id>
<award-id>61373044</award-id>
</award-group>
<award-group>
<funding-source xlink:href="http://dx.doi.org/10.13039/501100002858">China Postdoctoral Science Foundation</funding-source>
<award-id>2015M582621</award-id>
</award-group>
<award-group>
<funding-source>National Key Research and Development Program of China</funding-source>
<award-id>2016YFC0102000</award-id>
</award-group>
<award-group>
<funding-source>Fundamental Research Funds for the Central Universities</funding-source>
<award-id>JB150306</award-id>
<award-id>XJS15014</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="sec1">
<title>1. Introduction</title>
<p>A DNA motif is a conserved pattern occurring in the regulatory region of DNA sequences with small mutations [
<xref rid="B1" ref-type="bibr">1</xref>
]. All occurrences of the motif in the sequences are called motif instances or motif sites, which are usually the sequence fragments with specific biological functions such as transcription factor binding sites (TFBSs) [
<xref rid="B2" ref-type="bibr">2</xref>
]. TFBSs are important regulatory elements that control transcription initiation and transcription efficiency of the associated genes. Identifying motifs in a given set of DNA sequences is the basis for analysis of gene expression regulation [
<xref rid="B3" ref-type="bibr">3</xref>
] and the precursor to identifying disease-associated regulatory variations [
<xref rid="B4" ref-type="bibr">4</xref>
].</p>
<p>Though very important, motif discovery is a challenging computational task. Given a set of DNA sequences, (i) the motif and its instances are unknown; (ii) each of the input DNA sequences is long with hundreds of bases, while the motif is short, generally 5 to 25 bases [
<xref rid="B5" ref-type="bibr">5</xref>
]; (iii) a portion of the input sequences may not contain motif instances; (iv) the input sequences typically contain the disturbance of random overrepresented substrings. In 2003, Evans et al. proved that motif discovery is NP-complete [
<xref rid="B6" ref-type="bibr">6</xref>
]. In addition, with the development of biological experimental techniques, the data used for motif discovery have been changed from traditional promoter sequence data sets to high-throughput sequencing data sets [
<xref rid="B7" ref-type="bibr">7</xref>
]. A traditional data set typically contains only a few to dozens of sequences. A high-throughput sequencing data set is a set of peak regions containing TFBSs obtained through ChIP-seq experiments [
<xref rid="B8" ref-type="bibr">8</xref>
], read mapping [
<xref rid="B9" ref-type="bibr">9</xref>
], and peak calling [
<xref rid="B10" ref-type="bibr">10</xref>
]. It contains hundreds or more sequences, thus forming a large DNA sequence data set, and further increases the difficulty of rapid and accurate identification of motifs.</p>
<p>Currently, there are a lot of motif discovery algorithms to deal with small-scale data sets, such as Weeder [
<xref rid="B11" ref-type="bibr">11</xref>
], PairMotif [
<xref rid="B12" ref-type="bibr">12</xref>
], PairMotif+ [
<xref rid="B13" ref-type="bibr">13</xref>
], MEME [
<xref rid="B14" ref-type="bibr">14</xref>
], PMS8 [
<xref rid="B15" ref-type="bibr">15</xref>
], and qPMS9 [
<xref rid="B16" ref-type="bibr">16</xref>
]; for more algorithms, refer to [
<xref rid="B7" ref-type="bibr">7</xref>
,
<xref rid="B17" ref-type="bibr">17</xref>
]. Because of high time or space complexity, these algorithms cannot be used for motif discovery in high-throughput sequencing data sets directly.</p>
<p>This paper mainly focuses on motif discovery algorithms for high-throughput sequencing data sets. According to motif representation, the algorithms can be divided into two categories. The algorithms in the first category represent motifs as words. Some of these algorithms, such as F-motif [
<xref rid="B18" ref-type="bibr">18</xref>
] and weeder2 [
<xref rid="B19" ref-type="bibr">19</xref>
], use pattern-driven ideas. They exhaustively verify all possible strings of the motif length over the DNA alphabet and then output the strings that satisfy specified motif property. When verifying motifs, F-motif and weeder2 use the suffix tree and De Bruijn graph techniques, respectively. Some other algorithms, such as RSAT [
<xref rid="B20" ref-type="bibr">20</xref>
], CisFinder [
<xref rid="B21" ref-type="bibr">21</xref>
], and MCES [
<xref rid="B22" ref-type="bibr">22</xref>
], adopt word counting ideas; namely, they mine the substrings in input sequences with high occurrence frequency and then combine them into motifs. Besides a test set, these algorithms often require a control set to eliminate the disturbance of random overrepresented substrings.</p>
<p>The second category covers the discovery algorithms representing motifs as position weight matrixes (PWMs). A set of aligned sites of the same length in the input sequences can form a PWM. These algorithms often select some initial PWMs with certain means and then update each PWM iteratively until it reaches the maximum score. MEME-ChIP [
<xref rid="B23" ref-type="bibr">23</xref>
] is a well-known motif discovery algorithm for ChIP-seq data sets, which updates initial PWMs using the expectation maximization method. STEME [
<xref rid="B24" ref-type="bibr">24</xref>
], another discovery algorithm based on expectation maximization, uses suffix trees to improve the time performance of motif discovery when implementing expectation maximization. Currently, there is no discovery algorithm completely superior to others, and thus, in order to tackle false positives produced by individual discovery algorithms, ensemble algorithms [
<xref rid="B25" ref-type="bibr">25</xref>
] integrate multiple existing discovery algorithms to improve the quality of identified motifs.</p>
<p>In order to efficiently identify motifs in large DNA data sets, we propose a new algorithm, which identifies motifs by extracting and combining pairs of
<italic>l</italic>
-mers in the input with relatively small Hamming distance. Comparisons with the state-of-the-art motif discovery algorithms show that the proposed algorithm can find motifs successfully with the shortest running time. Also, the validity of the proposed algorithm has been verified on real data.</p>
</sec>
<sec id="sec2">
<title>2. Materials and Methods</title>
<sec id="sec2.1">
<title>2.1. Algorithm Overview</title>
<p>The notations frequently used in this paper are summarized in the Notations. When we say a pair of
<italic>l</italic>
-mers, we are referring to two
<italic>l</italic>
-mers that come from two distinct sequences.</p>
<p>Almost all de novo motif discovery algorithms make identification based on the fact that the motif instances of the same motif are similar to each other. In other words, the motif information contained in the input sequences is presented by the similarity among motif instances. In addition to the degree of similarity among motif instances, the motif information also depends on the number of pairs of motif instances contained in the input sequences, denoted by
<italic>N</italic>
<sub>mip</sub>
. It is calculated by
<disp-formula id="EEq1">
<label>(1)</label>
<mml:math id="M1">
<mml:mtable style="T1">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">m</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mi mathvariant="normal">p</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mi>t</mml:mi>
<mml:mfenced separators="" open="(" close=")">
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo></mml:mo>
<mml:mn mathvariant="normal">1</mml:mn>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mn mathvariant="normal">2</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>In our previous work, PairMotif [
<xref rid="B12" ref-type="bibr">12</xref>
] and PairMotif+ [
<xref rid="B13" ref-type="bibr">13</xref>
], we mainly process promoter sequences, which correspond to a small
<italic>t</italic>
. The basic idea is to extract some pairs of
<italic>l</italic>
-mers in the input, making them contain at least one pair of motif instances, and then refine each pair of
<italic>l</italic>
-mers to get motifs. Because of the small value of
<italic>N</italic>
<sub>mip</sub>
, limited motif information can be obtained while retaining a large amount of disturbance information. Thus, in order to ensure good identification accuracy, exhaustive methods based on pattern-driven ideas are used for refinement, which has a poor time performance.</p>
<p>In the current work, we propose a new algorithm called PairMotifChIP, which is used for processing large DNA data sets. Our basic idea is still to extract pairs of
<italic>l</italic>
-mers in the input. Since the value of
<italic>N</italic>
<sub>mip</sub>
under large data sets is significantly greater than that under traditional promoter data sets, the advantages are as follows: (i) the extracted pairs of
<italic>l</italic>
-mers contain sufficient pairs of motif instances and (ii) it can be easier to filter out most of the random overrepresented pairs of
<italic>l</italic>
-mers; namely, we can distinguish most of the pairs of motif instances and random overrepresented pairs of
<italic>l</italic>
-mers by probabilistic analysis (see
<xref ref-type="sec" rid="sec3.1">Section 3.1</xref>
). Therefore, after extracting pairs of
<italic>l</italic>
-mers, we perform filtration to filter out most of the random overrepresented pairs of
<italic>l</italic>
-mers and then combine the remaining
<italic>l</italic>
-mers using clustering methods to obtain motifs while eliminating other random overrepresented
<italic>l</italic>
-mers.</p>
<p>The overall algorithm of PairMotifChIP is shown in
<xref ref-type="fig" rid="alg1">Algorithm 1</xref>
, containing three steps: extracting pairs of
<italic>l</italic>
-mers (lines (2)–(4)), filtering pairs of
<italic>l</italic>
-mers (lines (5)–(9)), and combining
<italic>l</italic>
-mers (lines (10)–(13)). Next, the technical details of the three steps are described in detail.</p>
</sec>
<sec id="sec2.2">
<title>2.2. Extracting Pairs of
<italic>l</italic>
-mers</title>
<p>In this step, we need to determine the value of a threshold
<italic>k</italic>
so that we extract all pairs of
<italic>l</italic>
-mers
<italic>x</italic>
and
<italic>x</italic>
′ in the input satisfying
<italic>d</italic>
<sub>
<italic>H</italic>
</sub>
(
<italic>x</italic>
,
<italic>x</italic>
′) ≤
<italic>k</italic>
and design an efficient algorithm to extract pairs of
<italic>l</italic>
-mers.</p>
<p>Let
<italic>p</italic>
<sub>
<italic>i</italic>
</sub>
denote the probability that the Hamming distance between two random
<italic>l</italic>
-mers is no more than
<italic> i</italic>
.
<disp-formula id="EEq2">
<label>(2)</label>
<mml:math id="M2">
<mml:mtable style="T1">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo stretchy="false"></mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn mathvariant="normal">0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mrow>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>l</mml:mi>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>k</mml:mi>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
<mml:mo>×</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn mathvariant="normal">3</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn mathvariant="normal">4</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>Let
<italic>E</italic>
<sub>
<italic>i</italic>
</sub>
denote the expectation of the number of pairs of
<italic>l</italic>
-mers
<italic>x</italic>
and
<italic>x</italic>
′ in two
<italic> n</italic>
-length DNA sequences satisfying
<italic>d</italic>
<sub>
<italic>H</italic>
</sub>
(
<italic>x</italic>
,
<italic>x</italic>
′) ≤
<italic>i</italic>
.
<disp-formula id="EEq3">
<label>(3)</label>
<mml:math id="M3">
<mml:mtable style="T1">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi>E</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mspace height="7.08pt" depth="0.23pt"></mml:mspace>
<mml:mi>n</mml:mi>
<mml:mo></mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn mathvariant="normal">1</mml:mn>
<mml:mspace height="7.08pt" depth="0.23pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mn mathvariant="normal">2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>×</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>The threshold
<italic>k</italic>
is determined by (
<xref ref-type="disp-formula" rid="EEq4">4</xref>
), aiming at eliminating random overrepresented substrings in the pairs of
<italic>l</italic>
-mers extracted from two
<italic>n</italic>
-length sequences.
<disp-formula id="EEq4">
<label>(4)</label>
<mml:math id="M4">
<mml:mtable style="T1">
<mml:mtr>
<mml:mtd>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">max</mml:mi>
</mml:mrow>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mfenced open="{" close="}" separators="|">
<mml:mrow>
<mml:mspace height="7.08pt" depth="2.484pt"></mml:mspace>
<mml:mi>i</mml:mi>
<mml:mo>:</mml:mo>
<mml:mn mathvariant="normal">0</mml:mn>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>E</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo><</mml:mo>
<mml:mn mathvariant="normal">1</mml:mn>
<mml:mspace height="7.08pt" depth="2.484pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>In order to handle large data sets efficiently, a good time performance of extracting pairs of
<italic>l</italic>
-mers is necessary. For extracting pairs of
<italic>l</italic>
-mers
<italic>x</italic>
and
<italic>x</italic>
′ from two
<italic> n</italic>
-length sequences
<italic>s</italic>
and
<italic>s</italic>
′ satisfying
<italic>d</italic>
<sub>
<italic>H</italic>
</sub>
(
<italic>x</italic>
,
<italic>x</italic>
′) ≤
<italic>k</italic>
, the simplest way is to traverse each pair of
<italic>l</italic>
-mers
<italic>x</italic>
and
<italic>x</italic>
′ and calculate the Hamming distance
<italic>d</italic>
<sub>
<italic>H</italic>
</sub>
(
<italic>x</italic>
,
<italic>x</italic>
′) separately. The time complexity of this method is
<italic>O</italic>
(
<italic>n</italic>
<sup>2</sup>
<italic>l</italic>
). Yu et al. [
<xref rid="B26" ref-type="bibr">26</xref>
] proposed a method of
<italic>O</italic>
(
<italic>n</italic>
<sup>2</sup>
) time by filling an
<italic>n</italic>
×
<italic>n</italic>
matrix
<italic>M</italic>
. The element in row
<italic>i</italic>
 (1 ≤
<italic>i</italic>
<italic>n</italic>
) and column
<italic>j</italic>
 (1 ≤
<italic>j</italic>
<italic>n</italic>
) of
<italic>M</italic>
, denoted by
<italic>M</italic>
[
<italic>i</italic>
,
<italic>j</italic>
], stores min(
<italic>i</italic>
,
<italic>j</italic>
) −
<italic>d</italic>
<sub>
<italic>H</italic>
</sub>
(
<italic>s</italic>
[
<italic>i</italic>
− min⁡(
<italic>i</italic>
,
<italic>j</italic>
) + 1 ⋯
<italic>i</italic>
],
<italic>s</italic>
′[
<italic>j</italic>
− min⁡(
<italic>i</italic>
,
<italic>j</italic>
) + 1 ⋯
<italic>j</italic>
]), where min(
<italic>i</italic>
,
<italic>j</italic>
) is the smaller one of two integers. After calculating
<italic>M</italic>
[
<italic>i</italic>
,
<italic>j</italic>
], the Hamming distance between the two
<italic>l</italic>
-mers
<italic>s</italic>
[
<italic>i</italic>
<italic>l</italic>
+ 1 ⋯
<italic>i</italic>
] and
<italic>s</italic>
′[
<italic>j</italic>
<italic>l</italic>
+ 1 ⋯
<italic>j</italic>
] can be obtained; namely,
<italic>d</italic>
<sub>
<italic>H</italic>
</sub>
(
<italic>s</italic>
[
<italic>i</italic>
<italic>l</italic>
+ 1 ⋯
<italic>i</italic>
],
<italic>s</italic>
′[
<italic>j</italic>
<italic>l</italic>
+ 1 ⋯
<italic>j</italic>
]) =
<italic>l</italic>
− (
<italic>M</italic>
[
<italic>i</italic>
,
<italic>j</italic>
] −
<italic>M</italic>
[
<italic>i</italic>
<italic>l</italic>
,
<italic>j</italic>
<italic>l</italic>
]).</p>
<p>In this paper, we design a more efficient method. We only care about the pairs of
<italic>l</italic>
-mers with Hamming distance no more than
<italic>k</italic>
. That is, we do not need to calculate Hamming distance for all pairs of
<italic>l</italic>
-mers. If the Hamming distance between
<italic>s</italic>
[
<italic>i</italic>
<italic>l</italic>
+ 1 ⋯
<italic>i</italic>
] and
<italic>s</italic>
′[
<italic>j</italic>
<italic>l</italic>
+ 1 ⋯
<italic>j</italic>
] is greater than
<italic>k</italic>
, it can be concluded that, for 0 ≤
<italic>h</italic>
<
<italic>d</italic>
<sub>
<italic>H</italic>
</sub>
(
<italic>s</italic>
[
<italic>i</italic>
<italic>l</italic>
+ 1 ⋯
<italic>i</italic>
],
<italic>s</italic>
′[
<italic>j</italic>
<italic>l</italic>
+ 1 ⋯
<italic>j</italic>
]) −
<italic>k</italic>
, the possibly smallest Hamming distance between
<italic>s</italic>
[
<italic>i</italic>
<italic>l</italic>
+ 1 +
<italic>h</italic>
<italic>i</italic>
] and
<italic>s</italic>
′[
<italic>j</italic>
<italic>l</italic>
+ 1 +
<italic>h</italic>
<italic>j</italic>
] is still greater than
<italic>k</italic>
. Thus, after calculating the Hamming distance between
<italic>s</italic>
[
<italic>i</italic>
<italic>l</italic>
+ 1 ⋯
<italic>i</italic>
] and
<italic>s</italic>
′[
<italic>j</italic>
<italic>l</italic>
+ 1 ⋯
<italic>j</italic>
], we can go directly to verify the Hamming distance between
<italic>s</italic>
[
<italic>i</italic>
<italic>l</italic>
+ 1 +
<italic>H</italic>
<italic>i</italic>
] and
<italic>s</italic>
′[
<italic>j</italic>
<italic>l</italic>
+ 1 +
<italic>H</italic>
<italic>j</italic>
], where
<italic>H</italic>
is the skipped size calculated by
<disp-formula id="EEq5">
<label>(5)</label>
<mml:math id="M5">
<mml:mtable style="T∞8">
<mml:mtr>
<mml:mtd>
<mml:maligngroup></mml:maligngroup>
<mml:malignmark></mml:malignmark>
<mml:mi>H</mml:mi>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">max</mml:mi>
</mml:mrow>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mfenced open="{" close="}" separators="|">
<mml:mrow>
<mml:mspace height="9.52998pt" depth="4.43001pt"></mml:mspace>
<mml:mn mathvariant="normal">1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mspace height="9.12598pt" depth="2.98001pt"></mml:mspace>
<mml:mi>s</mml:mi>
<mml:mfenced separators="" open="[" close="]">
<mml:mrow>
<mml:mspace height="7.08pt" depth="0.23pt"></mml:mspace>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn mathvariant="normal">1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
<mml:mspace height="7.08pt" depth="0.23pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:msup>
<mml:mfenced separators="" open="[" close="]">
<mml:mrow>
<mml:mspace height="7.08pt" depth="2.59pt"></mml:mspace>
<mml:mi>j</mml:mi>
<mml:mo></mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn mathvariant="normal">1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>j</mml:mi>
<mml:mspace height="7.08pt" depth="2.59pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mspace height="9.12598pt" depth="2.98001pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mspace width="10pt"></mml:mspace>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mspace height="9.52998pt" depth="4.43001pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>Based on this, we describe our method as follows.
<xref ref-type="fig" rid="fig1"> Figure 1</xref>
shows an example in which |
<italic>s</italic>
| = |
<italic>s</italic>
′| = 20,  
<italic>l</italic>
= 5,  
<italic>k</italic>
= 1. First, convert the two DNA sequences
<italic>s</italic>
and
<italic>s</italic>
′ into binary strings
<italic>B</italic>
and
<italic>B</italic>
′. Then, fixing
<italic>B</italic>
, move
<italic>B</italic>
′ from left to right gradually 2 bits each time and simultaneously do the exclusive OR for the overlapped substrings of
<italic>B</italic>
and
<italic>B</italic>
′ with length equal to or greater than 2
<italic>l</italic>
; xor(
<italic>i</italic>
,
<italic>j</italic>
,
<italic>i</italic>
′,
<italic>j</italic>
′) denotes the exclusive OR of the overlapped substrings of
<italic>B</italic>
and
<italic>B</italic>
′ corresponding to
<italic>s</italic>
[
<italic>i</italic>
<italic>j</italic>
] and
<italic>s</italic>
′[
<italic>i</italic>
′ ⋯
<italic>j</italic>
′], where
<italic>j</italic>
<italic>i</italic>
=
<italic>j</italic>
′ −
<italic>i</italic>
′. Finally, extract pairs of
<italic>l</italic>
-mers with Hamming distance no more than
<italic>k</italic>
by traversing the exclusive OR xor(
<italic>i</italic>
,
<italic>j</italic>
,
<italic>i</italic>
′,
<italic>j</italic>
′) for each overlapped part of
<italic>B</italic>
and
<italic>B</italic>
′. Specifically, for each
<italic>r</italic>
 (
<italic>l</italic>
<italic>r</italic>
<italic>j</italic>
), we look up a table [
<xref rid="B12" ref-type="bibr">12</xref>
] to obtain the Hamming distance between two
<italic>l</italic>
-mers
<italic>s</italic>
[
<italic>i</italic>
+
<italic>r</italic>
<italic>l</italic>
<italic>i</italic>
+
<italic>r</italic>
− 1] and
<italic>s</italic>
′[
<italic>i</italic>
′ +
<italic>r</italic>
<italic>l</italic>
<italic>i</italic>
′ +
<italic>r</italic>
− 1], which is equal to the number of 2 bits that is not 00 in xor(
<italic>i</italic>
+
<italic>r</italic>
<italic>l</italic>
,
<italic>i</italic>
+
<italic>r</italic>
− 1,
<italic>i</italic>
′ +
<italic>r</italic>
<italic>l</italic>
,
<italic>i</italic>
′ +
<italic>r</italic>
− 1). During the traversal, we use (
<xref ref-type="disp-formula" rid="EEq5">5</xref>
) to avoid the calculation for some pairs of
<italic>l</italic>
-mers with Hamming distance greater than
<italic>k</italic>
.</p>
</sec>
<sec id="sec2.3">
<title>2.3. Filtering Pairs of
<italic>l</italic>
-mers</title>
<p>The purpose of this step is to filter those random overrepresented pairs of
<italic>l</italic>
-mers extracted in the previous step. According to the property of conservation, if an
<italic>l</italic>
-mer
<italic>x</italic>
is a motif instance, there may be some other motif instances with Hamming distance no more than
<italic>k</italic>
from
<italic> x</italic>
, and thus there may be relatively more
<italic>l</italic>
-mers in the whole data set with Hamming distance no more than
<italic>k</italic>
from
<italic>x</italic>
. If an
<italic>l</italic>
-mer
<italic>x</italic>
is not a motif instance, even if it appears in a random overrepresented pair of
<italic>l</italic>
-mers, there are not many
<italic>l</italic>
-mers in the whole data set with Hamming distance no more than
<italic>k</italic>
from
<italic>x</italic>
.</p>
<p>For an arbitrary
<italic>l</italic>
-mer
<italic>x</italic>
, let occ
<sub>
<italic>r</italic>
</sub>
(
<italic>x</italic>
) denote the number of
<italic>l</italic>
-mers in the input sequences with Hamming distance no more than
<italic>k</italic>
from
<italic>x</italic>
in random case.
<disp-formula id="EEq6">
<label>(6)</label>
<mml:math id="M6">
<mml:mtable style="T1">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="normal">o</mml:mi>
<mml:mi mathvariant="normal">c</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="normal">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mspace height="4.53pt" depth="0.12pt"></mml:mspace>
<mml:mi>x</mml:mi>
<mml:mspace height="4.53pt" depth="0.12pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>×</mml:mo>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mspace height="7.08pt" depth="0.23pt"></mml:mspace>
<mml:mi>n</mml:mi>
<mml:mo></mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn mathvariant="normal">1</mml:mn>
<mml:mspace height="7.08pt" depth="0.23pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mo>×</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>For an arbitrary motif instance
<italic>x</italic>
, let occ
<sub>
<italic>m</italic>
</sub>
(
<italic>x</italic>
) denote the number of motif instances in the input sequences with Hamming distance no more than
<italic>k</italic>
from
<italic>x</italic>
in random case.</p>
<p>We perform filtration according to (
<xref ref-type="disp-formula" rid="EEq7">7</xref>
). Let occ(
<italic>x</italic>
) denote the number of
<italic>l</italic>
-mers in the input sequences with Hamming distance no more than
<italic>k</italic>
from an
<italic>l</italic>
-mer
<italic>x</italic>
. In the process of extracting pairs of
<italic>l</italic>
-mers, we can easily record occ(
<italic>x</italic>
) for each
<italic>l</italic>
-mer
<italic>x</italic>
in the input sequences. For an extracted
<italic>l</italic>
-mer
<italic>x</italic>
, we filter it out if it does not satisfy
<disp-formula id="EEq7">
<label>(7)</label>
<mml:math id="M7">
<mml:mtable style="T1">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="normal">o</mml:mi>
<mml:mi mathvariant="normal">c</mml:mi>
<mml:mi mathvariant="normal">c</mml:mi>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mspace height="4.53pt" depth="0.12pt"></mml:mspace>
<mml:mi>x</mml:mi>
<mml:mspace height="4.53pt" depth="0.12pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mo></mml:mo>
<mml:mi mathvariant="normal">o</mml:mi>
<mml:mi mathvariant="normal">c</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="normal">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mspace height="4.53pt" depth="0.12pt"></mml:mspace>
<mml:mi>x</mml:mi>
<mml:mspace height="4.53pt" depth="0.12pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">o</mml:mi>
<mml:mi mathvariant="normal">c</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="normal">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mspace height="4.53pt" depth="0.12pt"></mml:mspace>
<mml:mi>x</mml:mi>
<mml:mspace height="4.53pt" depth="0.12pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>In the following, we focus on how to calculate occ
<sub>
<italic>m</italic>
</sub>
(
<italic>x</italic>
). We calculate it by combining the methods [
<xref rid="B13" ref-type="bibr">13</xref>
,
<xref rid="B22" ref-type="bibr">22</xref>
] for evaluating the following probabilities. One is the probability that the Hamming distance between a motif
<italic>m</italic>
and a random motif instance
<italic>m</italic>
′ is
<italic>i</italic>
 (0 ≤
<italic>i</italic>
<italic>d</italic>
), denoted by Pr(
<italic>d</italic>
<sub>
<italic>H</italic>
</sub>
(
<italic>m</italic>
,
<italic>m</italic>
′) =
<italic>i</italic>
). The other is the probability that the Hamming distance between two random motif instances
<italic>m</italic>
<sub>1</sub>
and
<italic>m</italic>
<sub>2</sub>
is no more than
<italic>k</italic>
, denoted by
<italic>p</italic>
<sub>
<italic>k</italic>
</sub>
′.</p>
<p>Let
<italic>d</italic>
denote the maximum number of positions where a motif differs from its instance. Given the motif length
<italic> l</italic>
, we determine the value of
<italic>d</italic>
in terms of the challenging problem instance of planted motif search [
<xref rid="B27" ref-type="bibr">27</xref>
], which assumes that motif instances are implanted into sequences. When generating a random motif instance
<italic>m</italic>
′ from a motif
<italic>m</italic>
, we select
<italic>d</italic>
out of
<italic>l</italic>
positions randomly and then make the character at each of the
<italic>d</italic>
positions change with a probability of
<italic>g</italic>
, which is the conservation parameter reflecting the conservation degree of a motif. Based on this, Pr(
<italic>d</italic>
<sub>
<italic>H</italic>
</sub>
(
<italic>m</italic>
,
<italic>m</italic>
′) =
<italic>i</italic>
) is evaluated as follows:
<disp-formula id="EEq8">
<label>(8)</label>
<mml:math id="M8">
<mml:mtable style="T1">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="normal">P</mml:mi>
<mml:mi mathvariant="normal">r</mml:mi>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mspace height="9.52998pt" depth="4.43001pt"></mml:mspace>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mspace height="9.12598pt" depth="1.16998pt"></mml:mspace>
<mml:mi>m</mml:mi>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:msup>
<mml:mspace height="9.12598pt" depth="1.16998pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mi>i</mml:mi>
<mml:mspace height="9.52998pt" depth="4.43001pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>d</mml:mi>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>i</mml:mi>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>g</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mspace height="6.35999pt" depth="2.59pt"></mml:mspace>
<mml:mn mathvariant="normal">1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>g</mml:mi>
<mml:mspace height="6.35999pt" depth="2.59pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>For any two instances
<italic>m</italic>
<sub>1</sub>
and
<italic>m</italic>
<sub>2</sub>
of a motif
<italic>m</italic>
, the possible values of 〈
<italic>d</italic>
<sub>
<italic>H</italic>
</sub>
(
<italic>m</italic>
,
<italic>m</italic>
<sub>1</sub>
),
<italic>d</italic>
<sub>
<italic>H</italic>
</sub>
(
<italic>m</italic>
,
<italic>m</italic>
<sub>2</sub>
)〉 are {〈
<italic>a</italic>
,
<italic>b</italic>
〉 :  0 ≤
<italic>a</italic>
<italic>d</italic>
,  0 ≤
<italic>b</italic>
<italic>d</italic>
}. Let Pr(〈
<italic>a</italic>
,
<italic>b</italic>
〉) denote Pr(〈
<italic>d</italic>
<sub>
<italic>H</italic>
</sub>
(
<italic>m</italic>
,
<italic>m</italic>
<sub>1</sub>
),
<italic>d</italic>
<sub>
<italic>H</italic>
</sub>
(
<italic>m</italic>
,
<italic>m</italic>
<sub>2</sub>
)〉 = 〈
<italic>a</italic>
,
<italic>b</italic>
〉). We have
<disp-formula id="EEq9">
<label>(9)</label>
<mml:math id="M9">
<mml:mtable style="T2">
<mml:mtr>
<mml:mtd>
<mml:maligngroup></mml:maligngroup>
<mml:mi mathvariant="normal">P</mml:mi>
<mml:mi mathvariant="normal">r</mml:mi>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mspace height="7.08pt" depth="1.79501pt"></mml:mspace>
<mml:mfenced open="〈" close="〉" separators="|">
<mml:mrow>
<mml:mspace height="7.08pt" depth="1.16998pt"></mml:mspace>
<mml:mi>a</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi>
<mml:mspace height="7.08pt" depth="1.16998pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mspace height="7.08pt" depth="1.79501pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:malignmark></mml:malignmark>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">P</mml:mi>
<mml:mi mathvariant="normal">r</mml:mi>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mspace height="8.07999pt" depth="2.98001pt"></mml:mspace>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mspace height="4.53pt" depth="2.4pt"></mml:mspace>
<mml:mi>m</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn mathvariant="normal">1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mspace height="4.53pt" depth="2.4pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mi>a</mml:mi>
<mml:mspace height="8.07999pt" depth="2.98001pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:maligngroup></mml:maligngroup>
<mml:malignmark></mml:malignmark>
<mml:mspace width="11.436553955078125pt"></mml:mspace>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="normal">P</mml:mi>
<mml:mi mathvariant="normal">r</mml:mi>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mspace height="8.07999pt" depth="2.98001pt"></mml:mspace>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mspace height="4.53pt" depth="2.4pt"></mml:mspace>
<mml:mi>m</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn mathvariant="normal">2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mspace height="4.53pt" depth="2.4pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mi>b</mml:mi>
<mml:mspace height="8.07999pt" depth="2.98001pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>By the theorem of total probability,
<italic>p</italic>
<sub>
<italic>k</italic>
</sub>
′ is evaluated using (
<xref ref-type="disp-formula" rid="EEq10">10</xref>
) where the value of
<italic>P</italic>
(
<italic>d</italic>
<sub>
<italic>H</italic>
</sub>
(
<italic>m</italic>
<sub>1</sub>
,
<italic>m</italic>
<sub>2</sub>
) ≤
<italic>k</italic>
∣〈
<italic>a</italic>
,
<italic>b</italic>
〉) is calculated in terms of the actual situation [
<xref rid="B13" ref-type="bibr">13</xref>
].
<disp-formula id="EEq10">
<label>(10)</label>
<mml:math id="M10">
<mml:mtable style="T3">
<mml:mtr>
<mml:mtd>
<mml:malignmark></mml:malignmark>
<mml:msubsup>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:maligngroup></mml:maligngroup>
<mml:malignmark></mml:malignmark>
<mml:mspace width="6.8067626953125pt"></mml:mspace>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:munder>
<mml:mstyle displaystyle="true">
<mml:mo stretchy="false"></mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mn mathvariant="normal">0</mml:mn>
<mml:mo></mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi>
<mml:mo></mml:mo>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mrow>
<mml:mi mathvariant="normal">P</mml:mi>
<mml:mi mathvariant="normal">r</mml:mi>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mspace height="7.08pt" depth="1.79501pt"></mml:mspace>
<mml:mfenced open="〈" close="〉" separators="|">
<mml:mrow>
<mml:mspace height="7.08pt" depth="1.16998pt"></mml:mspace>
<mml:mi>a</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi>
<mml:mspace height="7.08pt" depth="1.16998pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mspace height="7.08pt" depth="1.79501pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="normal">P</mml:mi>
<mml:mi mathvariant="normal">r</mml:mi>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mspace height="8.07999pt" depth="2.98001pt"></mml:mspace>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mspace height="4.53pt" depth="2.4pt"></mml:mspace>
<mml:msub>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn mathvariant="normal">1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn mathvariant="normal">2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mspace height="4.53pt" depth="2.4pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mfenced open="〈" close="〉" separators="|">
<mml:mrow>
<mml:mspace height="7.08pt" depth="1.16998pt"></mml:mspace>
<mml:mi>a</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi>
<mml:mspace height="7.08pt" depth="1.16998pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mspace height="8.07999pt" depth="2.98001pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>Finally, we calculate occ
<sub>
<italic>m</italic>
</sub>
(
<italic>x</italic>
) as follows:
<disp-formula id="EEq11">
<label>(11)</label>
<mml:math id="M11">
<mml:mtable style="T1">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="normal">o</mml:mi>
<mml:mi mathvariant="normal">c</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="normal">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mspace height="4.53pt" depth="0.12pt"></mml:mspace>
<mml:mi>x</mml:mi>
<mml:mspace height="4.53pt" depth="0.12pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>×</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
</sec>
<sec id="sec2.4">
<title>2.4. Combining
<italic>l</italic>
-mers</title>
<p>After performing filtration, we combine the remaining
<italic>l</italic>
-mers by using the clustering method. On the one hand, we further eliminate random overrepresented
<italic>l</italic>
-mers, as the filtration carried out in the previous step cannot guarantee that all the random overrepresented
<italic>l</italic>
-mers are filtered out. On the other hand, the input data may contain more than one motif, and thus the clustering method is also used to distinguish the instances of different motifs and gather the instances of the same motif together.</p>
<p>The combining method is described in detail as follows:
<list list-type="simple">
<list-item>
<label>(i)</label>
<p>Merge the overlapped
<italic>l</italic>
-mers into one substring. For example, for the three
<italic>l</italic>
-mers
<italic>s</italic>
[
<italic>i</italic>
<italic>i</italic>
+
<italic>l</italic>
− 1],
<italic>s</italic>
[
<italic>i</italic>
+ 2 ⋯
<italic>i</italic>
+
<italic>l</italic>
+ 1], and
<italic>s</italic>
[
<italic>i</italic>
+ 5 ⋯
<italic>i</italic>
+
<italic>l</italic>
+ 4] in the sequence
<italic>s</italic>
, they are overlapped, and we merge them into a new substring
<italic>s</italic>
[
<italic>i</italic>
<italic>i</italic>
+
<italic>l</italic>
+ 4].</p>
</list-item>
<list-item>
<label>(ii)</label>
<p>Cluster the substrings. At first, we build an undirected graph
<italic>G</italic>
by taking each substring as a vertex. For any two vertices
<italic>v</italic>
<sub>1</sub>
and
<italic>v</italic>
<sub>2</sub>
, assume the corresponding substrings are str
<sub>1</sub>
and str
<sub>2</sub>
, respectively. If there exist an
<italic>l</italic>
-mer
<italic>x</italic>
<sub>1</sub>
in str
<sub>1</sub>
and an
<italic>l</italic>
-mer
<italic>x</italic>
<sub>2</sub>
in str
<sub>2</sub>
such that
<italic>d</italic>
<sub>
<italic>H</italic>
</sub>
(
<italic>x</italic>
<sub>1</sub>
,
<italic>x</italic>
<sub>2</sub>
) ≤
<italic>k</italic>
, then we set the weight of the edge connected by
<italic>v</italic>
<sub>1</sub>
and
<italic>v</italic>
<sub>2</sub>
to 1, otherwise 0. Secondly, we cluster the vertices in the graph
<italic>G</italic>
using the MCL clustering algorithm [
<xref rid="B28" ref-type="bibr">28</xref>
]. Finally, we merge the obtained clusters using the MCL clustering algorithm again to further eliminate redundancy.</p>
</list-item>
<list-item>
<label>(iii)</label>
<p>For each obtained cluster, align the substrings in it and then fetch the fragment with high information content as an identified motif.</p>
</list-item>
</list>
</p>
<p>In the combining method, we control the number of elements to be clustered in order to ensure a good time performance. First, merging overlapped
<italic>l</italic>
-mers into one substring can help to reduce the elements to be clustered. Second, if the number of substrings is still large, we divide them into multiple groups with each group containing at most 1000 substrings. Then, we cluster substrings in each group separately and finally merge the obtained clusters.</p>
</sec>
</sec>
<sec id="sec3">
<title>3. Results and Discussion</title>
<sec id="sec3.1">
<title>3.1. Probabilistic Analysis of Extracting Pairs of
<italic>l</italic>
-mers</title>
<p>We use probabilistic analysis to demonstrate the feasibility of motif discovery by extracting pairs of
<italic>l</italic>
-mers with relatively small Hamming distance for processing large DNA data sets. That is, given
<italic>t</italic>
DNA sequences, we verify whether the extracted pairs of
<italic>l</italic>
-mers with Hamming distance no more than
<italic>k</italic>
contain sufficient pairs of motif instances and whether the pairs of motif instances and the random overrepresented pairs of
<italic>l</italic>
-mers can be distinguished.</p>
<p>
<xref ref-type="table" rid="tab1">Table 1</xref>
shows a set of data for probabilistic analysis. In generating these data, we set the number of sequences
<italic>t</italic>
, the sequence length
<italic>n</italic>
, the probability
<italic>q</italic>
that each input sequence contains a motif instance, and the motif conservation parameter
<italic>g</italic>
to 1000, 200, 0.8, and 0.8, respectively. For a fixed
<italic>l</italic>
,
<italic>k</italic>
is obtained by (
<xref ref-type="disp-formula" rid="EEq4">4</xref>
), which is the maximum making
<italic>E</italic>
<sub>
<italic>k</italic>
</sub>
smaller than 1. Let
<italic>N</italic>
<sub>1</sub>
and
<italic>N</italic>
<sub>2</sub>
denote the number of pairs of
<italic>l</italic>
-mers and that of pairs of motif instances with Hamming distance no more than
<italic>k</italic>
contained in
<italic>t</italic>
  
<italic>n</italic>
-length sequences at random, respectively. The notations occ
<sub>
<italic>r</italic>
</sub>
(
<italic>x</italic>
) and occ
<sub>
<italic>m</italic>
</sub>
(
<italic>x</italic>
) are explained in the Notations.
<disp-formula id="EEq12">
<label>(12)</label>
<mml:math id="M12">
<mml:mtable style="T17">
<mml:mtr>
<mml:mtd>
<mml:maligngroup></mml:maligngroup>
<mml:msub>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn mathvariant="normal">1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:malignmark></mml:malignmark>
<mml:mo>=</mml:mo>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>t</mml:mi>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn mathvariant="normal">2</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mspace height="7.08pt" depth="0.23pt"></mml:mspace>
<mml:mi>n</mml:mi>
<mml:mo></mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn mathvariant="normal">1</mml:mn>
<mml:mspace height="7.08pt" depth="0.23pt"></mml:mspace>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mn mathvariant="normal">2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>×</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:maligngroup></mml:maligngroup>
<mml:malignmark></mml:malignmark>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:maligngroup></mml:maligngroup>
<mml:msub>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn mathvariant="normal">2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:malignmark></mml:malignmark>
<mml:mo>=</mml:mo>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>q</mml:mi>
<mml:mi>t</mml:mi>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn mathvariant="normal">2</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
<mml:mo>×</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:maligngroup></mml:maligngroup>
<mml:malignmark></mml:malignmark>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>From the values of
<italic>N</italic>
<sub>1</sub>
and
<italic>N</italic>
<sub>2</sub>
, it can be found that the extracted pairs of
<italic>l</italic>
-mers with Hamming distance no more than
<italic>k</italic>
contain sufficient pairs of motif instances and meanwhile many random overrepresented pairs of
<italic>l</italic>
-mers. For a given
<italic>l</italic>
-mer
<italic>x</italic>
, the
<italic>l</italic>
-mer in the input with Hamming distance no more than
<italic>k</italic>
from
<italic>x</italic>
is called a
<italic> k</italic>
-neighbor of
<italic>x</italic>
. From the values of occ
<sub>
<italic>r</italic>
</sub>
(
<italic>x</italic>
) and occ
<sub>
<italic>m</italic>
</sub>
(
<italic>x</italic>
), the number of
<italic>k</italic>
-neighbors of a motif instance is significantly larger than that of random overrepresented
<italic>l</italic>
-mers. Thus, we can distinguish the pairs of motif instances and the random overrepresented pairs of
<italic>l</italic>
-mers, so as to filter out most random overrepresented pairs of
<italic>l</italic>
-mers extracted in step 1. It should be noted that, for
<xref ref-type="table" rid="tab1">Table 1</xref>
, we set the motif conservation parameter to 0.8, which is in low conservation case; the value of occ
<sub>
<italic>m</italic>
</sub>
(
<italic>x</italic>
) increases with the increase of the conservation degree, which makes it easier to distinguish the pairs of motif instances and the random overrepresented pairs of
<italic>l</italic>
-mers. In implementing PairMotifChIP, we fix the value of
<italic>l</italic>
to 15 because it corresponds to a large value of occ
<sub>
<italic>m</italic>
</sub>
(
<italic>x</italic>
).</p>
</sec>
<sec id="sec3.2">
<title>3.2. Data</title>
<p>We use both simulated and real data to make experiments. The simulated data are helpful for the comparison of different motif discovery algorithms, since the motifs and their locations are known exactly. The real data are mainly used to test whether the proposed algorithm can find motifs under the realistic biological situation.</p>
<p>We generate simulated DNA data according to [
<xref rid="B29" ref-type="bibr">29</xref>
]: generate
<italic>t</italic>
  
<italic>n</italic>
-length sequences and an
<italic> l</italic>
-length motif
<italic>m</italic>
randomly and then implant a random instance
<italic>m</italic>
′ of
<italic>m</italic>
to each sequence with a probability
<italic>q</italic>
. Each instance
<italic>m</italic>
′ differs from
<italic>m</italic>
in at most
<italic>d</italic>
positions; as shown in column 2 of
<xref ref-type="table" rid="tab1">Table 1</xref>
, the value of
<italic>d</italic>
under a specific
<italic>l</italic>
is determined in terms of the challenging problem instance of planted motif search [
<xref rid="B27" ref-type="bibr">27</xref>
]. When generating a random motif instance
<italic>m</italic>
′ from
<italic>m</italic>
, the specific Hamming distance
<italic>i</italic>
 (0 ≤
<italic>i</italic>
<italic>d</italic>
) between
<italic>m</italic>
′ and
<italic>m</italic>
is determined by (
<xref ref-type="disp-formula" rid="EEq8">8</xref>
), where
<italic>g</italic>
is the motif conservation parameter.</p>
<p>Based on this generation method, two groups of simulated data sets are designed, and they can be downloaded at
<ext-link ext-link-type="uri" xlink:href="https://sites.google.com/site/feqond/pairmotifchip">https://sites.google.com/site/feqond/pairmotifchip</ext-link>
. For both groups of data sets,
<italic>n</italic>
is fixed to 200;
<italic>q</italic>
is fixed to 0.8;
<italic> l</italic>
is taken as 9, 15, and 21, representing short, medium, and long motif length, respectively. The other settings for the first group of data sets are as follows:
<italic>t</italic>
is fixed to 600, which is the largest sequence number that MEME-ChIP supports to process;
<italic>g</italic>
is taken as 0.2, 0.5, and 0.8, representing high, medium, and low conservation, respectively. For the second group of simulated data sets, we vary
<italic>t</italic>
from 500 to 3000 to obtain different data scales and fix
<italic>g</italic>
as 0.5.</p>
<p>The used real data are mouse embryonic stem cells (mESC) ChIP-seq data [
<xref rid="B30" ref-type="bibr">30</xref>
]. Each mESC data set, which corresponds to a specific transcription factor, is a set of 200-length sequences with each sequence being a peak region bound by the specific transcription factor. These data sets contain thousands to tens of thousands of sequences, and we used the first 600 sequences to make motif discovery for each data set.</p>
</sec>
<sec id="sec3.3">
<title>3.3. Evaluation</title>
<p>In the experiments, we compare the running time and identification accuracy of different motif discovery algorithms. For the site-level identification accuracy, we evaluate it by using the site-level performance coefficient sPC [
<xref rid="B29" ref-type="bibr">29</xref>
]. When we say a motif site
<italic>m</italic>
′ is in a set of motif sites
<italic>M</italic>
, there exists a motif site
<italic>m</italic>
′′ in
<italic>M</italic>
such that
<italic>m</italic>
′ overlaps
<italic>m</italic>
′′. Let
<italic>K</italic>
and
<italic>P</italic>
represent the published motif sites and the predicted motif sites, respectively. Then, site-level true positive (sTP), false negative (sFN), and false positive (sFP) are the number of motif sites in both
<italic>K</italic>
and
<italic>P</italic>
, in
<italic>K</italic>
but not in
<italic>P</italic>
, and not in
<italic>K</italic>
but in
<italic>P</italic>
, respectively. Based on this, sPC is calculated as follows:
<disp-formula id="EEq14">
<label>(13)</label>
<mml:math id="M13">
<mml:mtable style="T1">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="normal">s</mml:mi>
<mml:mi mathvariant="normal">P</mml:mi>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="normal">s</mml:mi>
<mml:mi mathvariant="normal">T</mml:mi>
<mml:mi mathvariant="normal">P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">s</mml:mi>
<mml:mi mathvariant="normal">T</mml:mi>
<mml:mi mathvariant="normal">P</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">s</mml:mi>
<mml:mi mathvariant="normal">F</mml:mi>
<mml:mi mathvariant="normal">N</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">s</mml:mi>
<mml:mi mathvariant="normal">F</mml:mi>
<mml:mi mathvariant="normal">P</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>For the nucleotide-level identification accuracy, we evaluate it by using the nucleotide-level correlation coefficient (nCC) [
<xref rid="B31" ref-type="bibr">31</xref>
], an integrated assessment of nucleotide-level sensitivity (nSn) and specificity (nSp). The value of nCC ranges from −1 to +1; a high nCC indicates that the predicted motif is more accurate. Let
<italic>K</italic>
and
<italic>P</italic>
represent the nucleotide positions covered by the published motif sites and the predicated motif sites, respectively. Then, nucleotide-level true positive (nTP), false negative (nFN), false positive (nFP), and true negative (nTN) are the number of nucleotide positions in both
<italic>K</italic>
and
<italic>P</italic>
, in
<italic>K</italic>
but not in
<italic>P</italic>
, not in
<italic>K</italic>
but in
<italic>P</italic>
, and in neither
<italic>K</italic>
nor
<italic>P</italic>
, respectively. Based on this, nCC, nSn, and nSp are calculated as follows:
<disp-formula id="EEq15">
<label>(14)</label>
<mml:math id="M14">
<mml:mtable style="T19">
<mml:mtr>
<mml:mtd>
<mml:malignmark></mml:malignmark>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mi mathvariant="normal">C</mml:mi>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:maligngroup></mml:maligngroup>
<mml:malignmark></mml:malignmark>
<mml:mspace width="0pt"></mml:mspace>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mi mathvariant="normal">T</mml:mi>
<mml:mi mathvariant="normal">P</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mi mathvariant="normal">T</mml:mi>
<mml:mi mathvariant="normal">N</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mi mathvariant="normal">F</mml:mi>
<mml:mi mathvariant="normal">N</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mi mathvariant="normal">F</mml:mi>
<mml:mi mathvariant="normal">P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mi mathvariant="normal">T</mml:mi>
<mml:mi mathvariant="normal">P</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mi mathvariant="normal">F</mml:mi>
<mml:mi mathvariant="normal">N</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mi mathvariant="normal">T</mml:mi>
<mml:mi mathvariant="normal">N</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mi mathvariant="normal">F</mml:mi>
<mml:mi mathvariant="normal">P</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mi mathvariant="normal">T</mml:mi>
<mml:mi mathvariant="normal">P</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mi mathvariant="normal">F</mml:mi>
<mml:mi mathvariant="normal">P</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mfenced separators="|">
<mml:mrow>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mi mathvariant="normal">T</mml:mi>
<mml:mi mathvariant="normal">N</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mi mathvariant="normal">F</mml:mi>
<mml:mi mathvariant="normal">N</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:maligngroup></mml:maligngroup>
<mml:malignmark></mml:malignmark>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mi mathvariant="normal">S</mml:mi>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mi mathvariant="normal">T</mml:mi>
<mml:mi mathvariant="normal">P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mi mathvariant="normal">T</mml:mi>
<mml:mi mathvariant="normal">P</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mi mathvariant="normal">F</mml:mi>
<mml:mi mathvariant="normal">N</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:maligngroup></mml:maligngroup>
<mml:malignmark></mml:malignmark>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mi mathvariant="normal">S</mml:mi>
<mml:mi mathvariant="normal">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mi mathvariant="normal">T</mml:mi>
<mml:mi mathvariant="normal">N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mi mathvariant="normal">T</mml:mi>
<mml:mi mathvariant="normal">N</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mi mathvariant="normal">F</mml:mi>
<mml:mi mathvariant="normal">P</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
</sec>
<sec id="sec3.4">
<title>3.4. Results on Simulated Data</title>
<p>We select four compared algorithms: MEME-ChIP [
<xref rid="B23" ref-type="bibr">23</xref>
], F-motif [
<xref rid="B18" ref-type="bibr">18</xref>
], PairMotif+ [
<xref rid="B13" ref-type="bibr">13</xref>
], and qPMS9 [
<xref rid="B16" ref-type="bibr">16</xref>
]. MEME-ChIP is a widely used motif discovery algorithm for ChIP-seq data based on PWM. F-motif is a motif discovery algorithm for ChIP-seq data based on word. PairMotif+ is a motif discovery algorithm designed in our previous work. qPMS9 is a recently proposed motif discovery algorithm; it is the best one in exact motif discovery algorithms and can identify motif efficiently in traditional promoter sequences.</p>
<p>All the algorithms are implemented in C or C++. Except for MEME-ChIP, whose results are produced by its web server (
<ext-link ext-link-type="uri" xlink:href="http://meme-suite.org/tools/meme-chip">http://meme-suite.org/tools/meme-chip</ext-link>
), all the algorithms run on the same machine with a 2.67 GHz CPU and a 4 G memory. Each result is the average obtained on three randomly generated data sets with the same settings. Both PairMotifChIP and MEME-ChIP use the default parameters to identify motifs. In executing F-motif, the minimum motif length is set to 8, and the value of
<italic>d</italic>
is set to 2, 5, and 8 when the length of the identified motif is 9, 15, and 21, respectively. Both PairMotif+ and qPMS9 need specified
<italic>l</italic>
and
<italic>d</italic>
of the identified motif. When calculating identification accuracy, the predicated motif sites are needed. For MEME-ChIP, the predicated motif sites are returned by its web server; for other algorithms, the sites are obtained by searching the substring in each input sequence having the smallest Hamming distance from the predicted motif.</p>
<p>For the first group of simulated data sets, the running time, the site-level identification accuracy, and the nucleotide-level identification accuracy are given in Tables
<xref ref-type="table" rid="tab2">2</xref>
,
<xref ref-type="table" rid="tab3">3</xref>
, and
<xref ref-type="table" rid="tab4">4</xref>
, respectively. It can be seen that (i) all these algorithms show good identification accuracy, since large data sets contain sufficient motif information; (ii) PairMotifChIP has the best time performance among these algorithms; namely, it is able to deal with the DNA data set of 600 sequences within one minute; (iii) MEME-ChIP has the second best time performance, and it solves the problem within half an hour; (iv) for other compared algorithms, their running time becomes impractical with the increase of
<italic>l</italic>
because of exhaustively verifying
<italic> l</italic>
-length candidate motifs.</p>
<p>For the second group of simulated data sets, whose data scales range from 500 to 3000 sequences, the comparison of different algorithms is shown in
<xref ref-type="table" rid="tab5">Table 5</xref>
. Here, MEME-ChIP and qPMS9 are not taken as the compared algorithms, as MEME-ChIP allows the processed data set containing at most 600 sequences and qPMS9 has a poor time performance in processing large data sets. It can be seen that (i) the three motif discovery algorithms still have a good identification accuracy; (ii) PairMotifChIP performs slower than F-Motif when
<italic>l</italic>
is 9, while it is significantly faster than the other two algorithms when
<italic>l</italic>
is 15 and 21; (iii) the running time of PairMotifChIP increases with the increase of data scale, but it does not depend on the motif length.</p>
<p>Finally, it is necessary to test the method for extracting pairs of
<italic>l</italic>
-mers because it plays a key role in the time performance of the whole PairMotifChIP algorithm.
<xref ref-type="table" rid="tab6"> Table 6</xref>
shows the comparison of the running time of the method in this paper and the method in [
<xref rid="B26" ref-type="bibr">26</xref>
]. When extracting pairs of
<italic>l</italic>
-mers with Hamming distance no more than
<italic>k</italic>
in two given
<italic>n</italic>
-length sequences, although the worst time complexity of the method in this paper is still
<italic>O</italic>
(
<italic>n</italic>
<sup>2</sup>
), it has a better time performance in practice. Specifically, the method in this paper is 10 times faster than that in [
<xref rid="B26" ref-type="bibr">26</xref>
], which makes it possible to process more DNA sequences.</p>
</sec>
<sec id="sec3.5">
<title>3.5. Results on Real Data</title>
<p>We test PairMotifChIP's validity of motif discovery on the real ChIP-seq data (i.e., the mESC data). To better show the results, we take our previous algorithm PairMotif+ for comparison. For PairMotifChIP, we use default parameters without any prior information; for PairMotif+, we use the same setting
<italic>l</italic>
= 13,
<italic> d</italic>
= 4, and
<italic>k</italic>
= 2 for each data set.</p>
<p>
<xref ref-type="table" rid="tab7">Table 7</xref>
shows the results on these real data. For each data set, we list the name of the specific transcription factor, the published motif, and the running time and predicted motif of the two algorithms. The motifs are shown in the form of sequence logos [
<xref rid="B32" ref-type="bibr">32</xref>
]. PairMotifChIP runs much faster than PairMotif+ on these data. Moreover, PairMotifChIP can successfully find a motif overlapping the published motif for each data set, while PairMotif+ fails to make prediction on the n-Myc, Smad1, STAT3, and Tcfcp2I1 data sets.</p>
</sec>
</sec>
<sec id="sec4">
<title>4. Conclusions</title>
<p>In order to identify motifs in large DNA data sets, we propose a new algorithm, PairMotifChIP, which is a ChIP-tailored version of PairMotif/PairMotif+. The main difference between PairMotifChIP and PairMotif/PairMotif+ is that (i) PairMotifChIP designs a more efficient method for extracting pairs of
<italic>l</italic>
-mers and (ii) unlike PairMotif/PairMotif+, which obtains motifs by exhaustively verifying candidate motifs generated from extracted pairs of
<italic>l</italic>
-mers, PairMotifChIP obtains motifs by combining extracted
<italic>l</italic>
-mers based on clustering methods. The improvements of PairMotifChIP over PairMotif/PairMotif+ are that (i) PairMotifChIP runs much faster when identifying motifs in large data sets and (ii) PairMotifChIP can make motif discovery without any prior information (e.g., the motif length). The executable program of PairMotifChIP is available at
<ext-link ext-link-type="uri" xlink:href="https://sites.google.com/site/feqond/pairmotifchip">https://sites.google.com/site/feqond/pairmotifchip</ext-link>
.</p>
<p>It should be noted that, limited by the idea of combining overrepresented substrings, PairMotifChIP may not work well on the traditional promoter sequence data set containing dozens of sequences because of the lack of sufficient motif information.</p>
</sec>
</body>
<back>
<ack>
<title>Acknowledgments</title>
<p>This work was supported in part by the National Natural Science Foundation of China under Grants 61502366 and 61373044, the China Postdoctoral Science Foundation under Grant 2015M582621, the National Key Research and Development Program of China under Grant 2016YFC0102000, and the Fundamental Research Funds for the Central Universities under Grants JB150306 and XJS15014.</p>
</ack>
<glossary>
<title>Notations</title>
<def-list>
<def-item>
<term>
<italic>l</italic>
:</term>
<def>
<p>The motif length</p>
</def>
</def-item>
<def-item>
<term>
<italic>S</italic>
= {
<italic>s</italic>
<sub>1</sub>
,
<italic>s</italic>
<sub>2</sub>
, …,
<italic>s</italic>
<sub>
<italic>t</italic>
</sub>
}:</term>
<def>
<p>The input DNA data set; each input sequence
<italic>s</italic>
<sub>
<italic>i</italic>
</sub>
is an
<italic>n</italic>
-length string over the alphabet {A, C, G, T}</p>
</def>
</def-item>
<def-item>
<term>
<italic>t</italic>
:</term>
<def>
<p>The number of input sequences,
<italic>t</italic>
= |
<italic>S</italic>
|</p>
</def>
</def-item>
<def-item>
<term>
<italic>n</italic>
:</term>
<def>
<p>The length of each input sequence</p>
</def>
</def-item>
<def-item>
<term>
<italic>q</italic>
:</term>
<def>
<p>The probability that each input sequence contains a motif instance</p>
</def>
</def-item>
<def-item>
<term>
<italic>d</italic>
:</term>
<def>
<p>The maximum number of positions where a motif differs from its instance</p>
</def>
</def-item>
<def-item>
<term>
<italic>g</italic>
:</term>
<def>
<p>The motif conservation parameter</p>
</def>
</def-item>
<def-item>
<term>
<italic>l</italic>
-mer:</term>
<def>
<p>An
<italic>l</italic>
-length string over {A, C, G, T}</p>
</def>
</def-item>
<def-item>
<term>
<italic>k</italic>
:</term>
<def>
<p>The threshold of extracting pairs of
<italic>l</italic>
-mers, 0 ≤
<italic>k</italic>
<
<italic>l</italic>
</p>
</def>
</def-item>
<def-item>
<term>occ(
<italic>x</italic>
):</term>
<def>
<p>The number of
<italic>l</italic>
-mers in the input sequences with Hamming distance no more than
<italic>k</italic>
from an
<italic>l</italic>
-mer
<italic>x</italic>
</p>
</def>
</def-item>
<def-item>
<term>occ
<sub>
<italic>r</italic>
</sub>
(
<italic>x</italic>
):</term>
<def>
<p>The number of
<italic>l</italic>
-mers in the input sequences with Hamming distance no more than
<italic>k</italic>
from an arbitrary
<italic>l</italic>
-mer
<italic>x</italic>
in random case; it is calculated by (
<xref ref-type="disp-formula" rid="EEq6">6</xref>
)</p>
</def>
</def-item>
<def-item>
<term>occ
<sub>
<italic>m</italic>
</sub>
(
<italic>x</italic>
):</term>
<def>
<p>The number of motif instances in the input sequences with Hamming distance no more than
<italic>k</italic>
from an arbitrary motif instance
<italic>x</italic>
in random case; it is calculated by (
<xref ref-type="disp-formula" rid="EEq11">11</xref>
)</p>
</def>
</def-item>
<def-item>
<term>
<italic>s</italic>
[
<italic>i</italic>
<italic>j</italic>
]:</term>
<def>
<p>A substring of the string
<italic>s</italic>
starting from the
<italic>i</italic>
th position to the
<italic>j</italic>
th position.</p>
</def>
</def-item>
</def-list>
</glossary>
<sec>
<title>Competing Interests</title>
<p>The authors declare that there are no competing interests regarding the publication of this paper.</p>
</sec>
<ref-list>
<ref id="B1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>D'Haeseleer</surname>
<given-names>P.</given-names>
</name>
</person-group>
<article-title>What are DNA sequence motifs?</article-title>
<source>
<italic>Nature Biotechnology</italic>
</source>
<year>2006</year>
<volume>24</volume>
<issue>4</issue>
<fpage>423</fpage>
<lpage>425</lpage>
<pub-id pub-id-type="doi">10.1038/nbt0406-423</pub-id>
<pub-id pub-id-type="other">2-s2.0-33645748095</pub-id>
</element-citation>
</ref>
<ref id="B2">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Matys</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Fricke</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Geffers</surname>
<given-names>R.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>TRANSFAC®: transcriptional regulation, from patterns to profiles</article-title>
<source>
<italic>Nucleic Acids Research</italic>
</source>
<year>2003</year>
<volume>31</volume>
<issue>1</issue>
<fpage>374</fpage>
<lpage>378</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkg108</pub-id>
<pub-id pub-id-type="other">2-s2.0-0037208166</pub-id>
<pub-id pub-id-type="pmid">12520026</pub-id>
</element-citation>
</ref>
<ref id="B3">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>GuhaThakurta</surname>
<given-names>D.</given-names>
</name>
</person-group>
<article-title>Computational identification of transcriptional regulatory elements in DNA sequence</article-title>
<source>
<italic>Nucleic Acids Research</italic>
</source>
<year>2006</year>
<volume>34</volume>
<issue>12</issue>
<fpage>3585</fpage>
<lpage>3598</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkl372</pub-id>
<pub-id pub-id-type="other">2-s2.0-33746835021</pub-id>
<pub-id pub-id-type="pmid">16855295</pub-id>
</element-citation>
</ref>
<ref id="B4">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mathelier</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Wasserman</surname>
<given-names>W. W.</given-names>
</name>
</person-group>
<article-title>Identification of altered cis-regulatory elements in human disease</article-title>
<source>
<italic>Trends in Genetics</italic>
</source>
<year>2015</year>
<volume>31</volume>
<issue>2</issue>
<fpage>67</fpage>
<lpage>76</lpage>
<pub-id pub-id-type="doi">10.1016/j.tig.2014.12.003</pub-id>
<pub-id pub-id-type="other">2-s2.0-84921859918</pub-id>
<pub-id pub-id-type="pmid">25637093</pub-id>
</element-citation>
</ref>
<ref id="B5">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huang</surname>
<given-names>C.-W.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>W.-S.</given-names>
</name>
<name>
<surname>Hsieh</surname>
<given-names>S.-Y.</given-names>
</name>
</person-group>
<article-title>An improved heuristic algorithm for finding motif signals in DNA sequences</article-title>
<source>
<italic>IEEE/ACM Transactions on Computational Biology and Bioinformatics</italic>
</source>
<year>2011</year>
<volume>8</volume>
<issue>4</issue>
<fpage>959</fpage>
<lpage>975</lpage>
<pub-id pub-id-type="doi">10.1109/TCBB.2010.92</pub-id>
<pub-id pub-id-type="other">2-s2.0-79957634274</pub-id>
<pub-id pub-id-type="pmid">20855921</pub-id>
</element-citation>
</ref>
<ref id="B6">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Evans</surname>
<given-names>P. A.</given-names>
</name>
<name>
<surname>Smith</surname>
<given-names>A. D.</given-names>
</name>
<name>
<surname>Wareham</surname>
<given-names>H. T.</given-names>
</name>
</person-group>
<article-title>On the complexity of finding common approximate substrings</article-title>
<source>
<italic>Theoretical Computer Science</italic>
</source>
<year>2003</year>
<volume>306</volume>
<issue>1–3</issue>
<fpage>407</fpage>
<lpage>430</lpage>
<pub-id pub-id-type="doi">10.1016/s0304-3975(03)00320-7</pub-id>
<pub-id pub-id-type="other">MR2000185</pub-id>
<pub-id pub-id-type="other">2-s2.0-0042890585</pub-id>
</element-citation>
</ref>
<ref id="B7">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zambelli</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Pesole</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Pavesi</surname>
<given-names>G.</given-names>
</name>
</person-group>
<article-title>Motif discovery and transcription factor binding sites before and after the next-generation sequencing era</article-title>
<source>
<italic>Briefings in Bioinformatics</italic>
</source>
<year>2013</year>
<volume>14</volume>
<issue>2</issue>
<fpage>225</fpage>
<lpage>237</lpage>
<pub-id pub-id-type="publisher-id">bbs016</pub-id>
<pub-id pub-id-type="doi">10.1093/bib/bbs016</pub-id>
<pub-id pub-id-type="other">2-s2.0-84875590756</pub-id>
<pub-id pub-id-type="pmid">22517426</pub-id>
</element-citation>
</ref>
<ref id="B8">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mardis</surname>
<given-names>E. R.</given-names>
</name>
</person-group>
<article-title>ChIP-seq: welcome to the new frontier</article-title>
<source>
<italic>Nature Methods</italic>
</source>
<year>2007</year>
<volume>4</volume>
<issue>8</issue>
<fpage>613</fpage>
<lpage>614</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth0807-613</pub-id>
<pub-id pub-id-type="other">2-s2.0-34547630151</pub-id>
<pub-id pub-id-type="pmid">17664943</pub-id>
</element-citation>
</ref>
<ref id="B9">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Homer</surname>
<given-names>N.</given-names>
</name>
</person-group>
<article-title>A survey of sequence alignment algorithms for next-generation sequencing</article-title>
<source>
<italic>Briefings in Bioinformatics</italic>
</source>
<year>2010</year>
<volume>11</volume>
<issue>5</issue>
<fpage>473</fpage>
<lpage>483</lpage>
<pub-id pub-id-type="doi">10.1093/bib/bbq015</pub-id>
<pub-id pub-id-type="other">2-s2.0-77957272611</pub-id>
<pub-id pub-id-type="pmid">20460430</pub-id>
</element-citation>
</ref>
<ref id="B10">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kim</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Selby</surname>
<given-names>H.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>A short survey of computational analysis methods in analysing ChIP-seq data</article-title>
<source>
<italic>Human Genomics</italic>
</source>
<year>2011</year>
<volume>5</volume>
<issue>2</issue>
<fpage>117</fpage>
<lpage>123</lpage>
<pub-id pub-id-type="doi">10.1186/1479-7364-5-2-117</pub-id>
<pub-id pub-id-type="other">2-s2.0-79955838994</pub-id>
<pub-id pub-id-type="pmid">21296745</pub-id>
</element-citation>
</ref>
<ref id="B11">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pavesi</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Mereghetti</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Mauri</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Pesole</surname>
<given-names>G.</given-names>
</name>
</person-group>
<article-title>Weeder web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes</article-title>
<source>
<italic>Nucleic Acids Research</italic>
</source>
<year>2004</year>
<volume>32</volume>
<fpage>W199</fpage>
<lpage>W203</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkh465</pub-id>
<pub-id pub-id-type="other">2-s2.0-3242884167</pub-id>
<pub-id pub-id-type="pmid">15215380</pub-id>
</element-citation>
</ref>
<ref id="B12">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Huo</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Guo</surname>
<given-names>H.</given-names>
</name>
</person-group>
<article-title>PairMotif: a new pattern-driven algorithm for planted (
<italic>l, d</italic>
) DNA motif search</article-title>
<source>
<italic>PLoS ONE</italic>
</source>
<year>2012</year>
<volume>7</volume>
<issue>10</issue>
<pub-id pub-id-type="publisher-id">e48442</pub-id>
<pub-id pub-id-type="doi">10.1371/journal.pone.0048442</pub-id>
<pub-id pub-id-type="other">2-s2.0-84868324237</pub-id>
</element-citation>
</ref>
<ref id="B13">
<label>13</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Huo</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Guo</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Guo</surname>
<given-names>H.</given-names>
</name>
</person-group>
<article-title>PairMotif+: a fast and effective algorithm for de novo motif discovery in DNA sequences</article-title>
<source>
<italic>International Journal of Biological Sciences</italic>
</source>
<year>2013</year>
<volume>9</volume>
<issue>4</issue>
<fpage>412</fpage>
<lpage>424</lpage>
<pub-id pub-id-type="doi">10.7150/ijbs.5786</pub-id>
<pub-id pub-id-type="other">2-s2.0-84876989317</pub-id>
<pub-id pub-id-type="pmid">23678291</pub-id>
</element-citation>
</ref>
<ref id="B14">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bailey</surname>
<given-names>T. L.</given-names>
</name>
<name>
<surname>Williams</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Misleh</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>W. W.</given-names>
</name>
</person-group>
<article-title>MEME: discovering and analyzing DNA and protein sequence motifs</article-title>
<source>
<italic>Nucleic Acids Research</italic>
</source>
<year>2006</year>
<volume>34</volume>
<fpage>W369</fpage>
<lpage>W373</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkl198</pub-id>
<pub-id pub-id-type="other">2-s2.0-33747823564</pub-id>
<pub-id pub-id-type="pmid">16845028</pub-id>
</element-citation>
</ref>
<ref id="B15">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nicolae</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Rajasekaran</surname>
<given-names>S.</given-names>
</name>
</person-group>
<article-title>Efficient sequential and parallel algorithms for planted motif search</article-title>
<source>
<italic>BMC Bioinformatics</italic>
</source>
<year>2014</year>
<volume>15</volume>
<issue>1, article 34</issue>
<pub-id pub-id-type="doi">10.1186/1471-2105-15-34</pub-id>
<pub-id pub-id-type="other">2-s2.0-84893200025</pub-id>
</element-citation>
</ref>
<ref id="B16">
<label>16</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nicolae</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Rajasekaran</surname>
<given-names>S.</given-names>
</name>
</person-group>
<article-title>qPMS9: an efficient algorithm for quorum planted motif search</article-title>
<source>
<italic>Scientific Reports</italic>
</source>
<year>2015</year>
<volume>5, article 7813</volume>
</element-citation>
</ref>
<ref id="B17">
<label>17</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Das</surname>
<given-names>M. K.</given-names>
</name>
<name>
<surname>Dai</surname>
<given-names>H.-K.</given-names>
</name>
</person-group>
<article-title>A survey of DNA motif finding algorithms</article-title>
<source>
<italic>BMC Bioinformatics</italic>
</source>
<year>2007</year>
<volume>8</volume>
<issue>supplement 7, article S21</issue>
<pub-id pub-id-type="doi">10.1186/1471-2105-8-s7-s21</pub-id>
<pub-id pub-id-type="other">2-s2.0-38549144819</pub-id>
</element-citation>
</ref>
<ref id="B18">
<label>18</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jia</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Carson</surname>
<given-names>M. B.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>H.</given-names>
</name>
</person-group>
<article-title>A new exhaustive method and strategy for finding motifs in ChIP-enriched regions</article-title>
<source>
<italic>PLoS ONE</italic>
</source>
<year>2014</year>
<volume>9</volume>
<issue>1</issue>
<pub-id pub-id-type="publisher-id">e86044</pub-id>
<pub-id pub-id-type="doi">10.1371/journal.pone.0086044</pub-id>
<pub-id pub-id-type="other">2-s2.0-84899858110</pub-id>
</element-citation>
</ref>
<ref id="B19">
<label>19</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Zambelli</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Pavesi</surname>
<given-names>G.</given-names>
</name>
</person-group>
<article-title>A faster algorithm for motif finding in sequences from ChIP-Seq data</article-title>
<conf-name>Proceedings of the 8th International Meeting of Computational Intelligence Methods for Bioinformatics and Biostatistics</conf-name>
<conf-date>June 2011</conf-date>
<conf-loc>Gargnano del Garda, Italy</conf-loc>
<fpage>201</fpage>
<lpage>212</lpage>
</element-citation>
</ref>
<ref id="B20">
<label>20</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Thomas-Chollier</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Herrmann</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Defrance</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Sand</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Thieffry</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>van Helden</surname>
<given-names>J.</given-names>
</name>
</person-group>
<article-title>RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets</article-title>
<source>
<italic>Nucleic Acids Research</italic>
</source>
<year>2012</year>
<volume>40</volume>
<issue>4, article e31</issue>
<pub-id pub-id-type="doi">10.1093/nar/gkr1104</pub-id>
<pub-id pub-id-type="other">2-s2.0-84857859359</pub-id>
</element-citation>
</ref>
<ref id="B21">
<label>21</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sharov</surname>
<given-names>A. A.</given-names>
</name>
<name>
<surname>Ko</surname>
<given-names>M. S. H.</given-names>
</name>
</person-group>
<article-title>Exhaustive search for over-represented DNA sequence motifs with CisFinder</article-title>
<source>
<italic>DNA Research</italic>
</source>
<year>2009</year>
<volume>16</volume>
<issue>5</issue>
<fpage>261</fpage>
<lpage>273</lpage>
<pub-id pub-id-type="doi">10.1093/dnares/dsp014</pub-id>
<pub-id pub-id-type="other">2-s2.0-74949138014</pub-id>
<pub-id pub-id-type="pmid">19740934</pub-id>
</element-citation>
</ref>
<ref id="B22">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Huo</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Guo</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Vitter</surname>
<given-names>J. S.</given-names>
</name>
<name>
<surname>Huan</surname>
<given-names>J.</given-names>
</name>
</person-group>
<article-title>An efficient algorithm for discovering motifs in large DNA data sets</article-title>
<source>
<italic>IEEE Transactions on Nanobioscience</italic>
</source>
<year>2015</year>
<volume>14</volume>
<issue>5</issue>
<fpage>535</fpage>
<lpage>544</lpage>
<pub-id pub-id-type="doi">10.1109/tnb.2015.2421340</pub-id>
<pub-id pub-id-type="other">2-s2.0-84939181502</pub-id>
<pub-id pub-id-type="pmid">25872217</pub-id>
</element-citation>
</ref>
<ref id="B23">
<label>23</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Machanick</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Bailey</surname>
<given-names>T. L.</given-names>
</name>
</person-group>
<article-title>MEME-ChIP: motif analysis of large DNA datasets</article-title>
<source>
<italic>Bioinformatics</italic>
</source>
<year>2011</year>
<volume>27</volume>
<issue>12</issue>
<fpage>1696</fpage>
<lpage>1697</lpage>
<pub-id pub-id-type="publisher-id">btr189</pub-id>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr189</pub-id>
<pub-id pub-id-type="other">2-s2.0-79958117256</pub-id>
<pub-id pub-id-type="pmid">21486936</pub-id>
</element-citation>
</ref>
<ref id="B24">
<label>24</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Reid</surname>
<given-names>J. E.</given-names>
</name>
<name>
<surname>Wernisch</surname>
<given-names>L.</given-names>
</name>
</person-group>
<article-title>STEME: efficient em to find motifs in large data sets</article-title>
<source>
<italic>Nucleic Acids Research</italic>
</source>
<year>2011</year>
<volume>39</volume>
<issue>18, article e126</issue>
<pub-id pub-id-type="doi">10.1093/nar/gkr574</pub-id>
<pub-id pub-id-type="other">2-s2.0-80054070341</pub-id>
</element-citation>
</ref>
<ref id="B25">
<label>25</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lihu</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Holban</surname>
<given-names>Ş.</given-names>
</name>
</person-group>
<article-title>A review of ensemble methods for de novo motif discovery in ChIP-Seq data</article-title>
<source>
<italic>Briefings in Bioinformatics</italic>
</source>
<year>2015</year>
<volume>16</volume>
<issue>6</issue>
<fpage>964</fpage>
<lpage>973</lpage>
<pub-id pub-id-type="doi">10.1093/bib/bbv022</pub-id>
<pub-id pub-id-type="other">2-s2.0-84954318496</pub-id>
<pub-id pub-id-type="pmid">25888697</pub-id>
</element-citation>
</ref>
<ref id="B26">
<label>26</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Huo</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Feng</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Vitter</surname>
<given-names>J. S.</given-names>
</name>
<name>
<surname>Huan</surname>
<given-names>J.</given-names>
</name>
</person-group>
<article-title>Reference sequence selection for motif searches</article-title>
<conf-name>Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM '15)</conf-name>
<conf-date>November 2015</conf-date>
<conf-loc>Washington, DC, USA</conf-loc>
<publisher-name>IEEE</publisher-name>
<fpage>569</fpage>
<lpage>574</lpage>
<pub-id pub-id-type="doi">10.1109/bibm.2015.7359745</pub-id>
</element-citation>
</ref>
<ref id="B27">
<label>27</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Buhler</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Tompa</surname>
<given-names>M.</given-names>
</name>
</person-group>
<article-title>Finding motifs using random projections</article-title>
<source>
<italic>Journal of Computational Biology</italic>
</source>
<year>2002</year>
<volume>9</volume>
<issue>2</issue>
<fpage>225</fpage>
<lpage>242</lpage>
<pub-id pub-id-type="doi">10.1089/10665270252935430</pub-id>
<pub-id pub-id-type="other">2-s2.0-0036110853</pub-id>
<pub-id pub-id-type="pmid">12015879</pub-id>
</element-citation>
</ref>
<ref id="B28">
<label>28</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>van Dongen</surname>
<given-names>S.</given-names>
</name>
</person-group>
<source>
<italic>Graph clustering by flow simulation [Ph.D. thesis]</italic>
</source>
<year>2000</year>
<publisher-loc>Utrecht, The Netherlands</publisher-loc>
<publisher-name>Utrecht University</publisher-name>
</element-citation>
</ref>
<ref id="B29">
<label>29</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Pevzner</surname>
<given-names>P. A.</given-names>
</name>
<name>
<surname>Sze</surname>
<given-names>S. H.</given-names>
</name>
</person-group>
<article-title>Combinatorial approaches to finding subtle signals in DNA sequences</article-title>
<conf-name>Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology</conf-name>
<conf-date>August 2000</conf-date>
<conf-loc>Menlo Park, Calif, USA</conf-loc>
<fpage>269</fpage>
<lpage>278</lpage>
</element-citation>
</ref>
<ref id="B30">
<label>30</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Yuan</surname>
<given-names>P.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Integration of external signaling pathways with the core transcriptional network in embryonic stem cells</article-title>
<source>
<italic>Cell</italic>
</source>
<year>2008</year>
<volume>133</volume>
<issue>6</issue>
<fpage>1106</fpage>
<lpage>1117</lpage>
<pub-id pub-id-type="doi">10.1016/j.cell.2008.04.043</pub-id>
<pub-id pub-id-type="other">2-s2.0-44649117905</pub-id>
<pub-id pub-id-type="pmid">18555785</pub-id>
</element-citation>
</ref>
<ref id="B31">
<label>31</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tompa</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Bailey</surname>
<given-names>T. L.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Assessing computational tools for the discovery of transcription factor binding sites</article-title>
<source>
<italic>Nature Biotechnology</italic>
</source>
<year>2005</year>
<volume>23</volume>
<issue>1</issue>
<fpage>137</fpage>
<lpage>144</lpage>
<pub-id pub-id-type="doi">10.1038/nbt1053</pub-id>
<pub-id pub-id-type="other">2-s2.0-21144439147</pub-id>
</element-citation>
</ref>
<ref id="B32">
<label>32</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Crooks</surname>
<given-names>G. E.</given-names>
</name>
<name>
<surname>Hon</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Chandonia</surname>
<given-names>J.-M.</given-names>
</name>
<name>
<surname>Brenner</surname>
<given-names>S. E.</given-names>
</name>
</person-group>
<article-title>WebLogo: a sequence logo generator</article-title>
<source>
<italic>Genome Research</italic>
</source>
<year>2004</year>
<volume>14</volume>
<issue>6</issue>
<fpage>1188</fpage>
<lpage>1190</lpage>
<pub-id pub-id-type="doi">10.1101/gr.849004</pub-id>
<pub-id pub-id-type="other">2-s2.0-2142738304</pub-id>
<pub-id pub-id-type="pmid">15173120</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
<floats-group>
<fig id="fig1" orientation="portrait" position="float">
<label>Figure 1</label>
<caption>
<p>An example for extracting pairs of
<italic>l</italic>
-mers.</p>
</caption>
<graphic xlink:href="BMRI2016-4986707.001"></graphic>
</fig>
<fig id="alg1" orientation="portrait" position="float">
<label>Algorithm 1</label>
<caption>
<p>PairMotifChIP.</p>
</caption>
<graphic xlink:href="BMRI2016-4986707.alg.001"></graphic>
</fig>
<table-wrap id="tab1" orientation="portrait" position="float">
<label>Table 1</label>
<caption>
<p>Data for probabilistic analysis.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="1" colspan="1">
<italic>l</italic>
</th>
<th align="center" rowspan="1" colspan="1">
<italic>d</italic>
</th>
<th align="center" rowspan="1" colspan="1">
<italic>k</italic>
</th>
<th align="center" rowspan="1" colspan="1">
<italic>E</italic>
<sub>
<italic>k</italic>
</sub>
</th>
<th align="center" rowspan="1" colspan="1">
<italic>N</italic>
<sub>1</sub>
</th>
<th align="center" rowspan="1" colspan="1">
<italic>N</italic>
<sub>2</sub>
</th>
<th align="center" rowspan="1" colspan="1">occ
<sub>
<italic>r</italic>
</sub>
(
<italic>x</italic>
)</th>
<th align="center" rowspan="1" colspan="1">occ
<sub>
<italic>m</italic>
</sub>
(
<italic>x</italic>
)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">8</td>
<td align="center" rowspan="1" colspan="1">1</td>
<td align="center" rowspan="1" colspan="1">0</td>
<td align="center" rowspan="1" colspan="1">0.57</td>
<td align="center" rowspan="1" colspan="1">284715</td>
<td align="center" rowspan="1" colspan="1">12784</td>
<td align="center" rowspan="1" colspan="1">2.95</td>
<td align="center" rowspan="1" colspan="1">32.00</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">9</td>
<td align="center" rowspan="1" colspan="1">2</td>
<td align="center" rowspan="1" colspan="1">0</td>
<td align="center" rowspan="1" colspan="1">0.14</td>
<td align="center" rowspan="1" colspan="1">69930</td>
<td align="center" rowspan="1" colspan="1">511</td>
<td align="center" rowspan="1" colspan="1">0.73</td>
<td align="center" rowspan="1" colspan="1">1.28</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">10</td>
<td align="center" rowspan="1" colspan="1">2</td>
<td align="center" rowspan="1" colspan="1">0</td>
<td align="center" rowspan="1" colspan="1">0.04</td>
<td align="center" rowspan="1" colspan="1">19980</td>
<td align="center" rowspan="1" colspan="1">511</td>
<td align="center" rowspan="1" colspan="1">0.19</td>
<td align="center" rowspan="1" colspan="1">1.28</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">11</td>
<td align="center" rowspan="1" colspan="1">3</td>
<td align="center" rowspan="1" colspan="1">1</td>
<td align="center" rowspan="1" colspan="1">0.29</td>
<td align="center" rowspan="1" colspan="1">144855</td>
<td align="center" rowspan="1" colspan="1">3577</td>
<td align="center" rowspan="1" colspan="1">1.54</td>
<td align="center" rowspan="1" colspan="1">8.95</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">12</td>
<td align="center" rowspan="1" colspan="1">3</td>
<td align="center" rowspan="1" colspan="1">1</td>
<td align="center" rowspan="1" colspan="1">0.08</td>
<td align="center" rowspan="1" colspan="1">39960</td>
<td align="center" rowspan="1" colspan="1">3196</td>
<td align="center" rowspan="1" colspan="1">0.42</td>
<td align="center" rowspan="1" colspan="1">8.00</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">13</td>
<td align="center" rowspan="1" colspan="1">4</td>
<td align="center" rowspan="1" colspan="1">2</td>
<td align="center" rowspan="1" colspan="1">0.39</td>
<td align="center" rowspan="1" colspan="1">194805</td>
<td align="center" rowspan="1" colspan="1">4581</td>
<td align="center" rowspan="1" colspan="1">2.08</td>
<td align="center" rowspan="1" colspan="1">11.46</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">14</td>
<td align="center" rowspan="1" colspan="1">4</td>
<td align="center" rowspan="1" colspan="1">2</td>
<td align="center" rowspan="1" colspan="1">0.11</td>
<td align="center" rowspan="1" colspan="1">54945</td>
<td align="center" rowspan="1" colspan="1">4059</td>
<td align="center" rowspan="1" colspan="1">0.60</td>
<td align="center" rowspan="1" colspan="1">10.16</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">15</td>
<td align="center" rowspan="1" colspan="1">5</td>
<td align="center" rowspan="1" colspan="1">3</td>
<td align="center" rowspan="1" colspan="1">0.43</td>
<td align="center" rowspan="1" colspan="1">214785</td>
<td align="center" rowspan="1" colspan="1">5274</td>
<td align="center" rowspan="1" colspan="1">2.30</td>
<td align="center" rowspan="1" colspan="1">13.20</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">16</td>
<td align="center" rowspan="1" colspan="1">5</td>
<td align="center" rowspan="1" colspan="1">3</td>
<td align="center" rowspan="1" colspan="1">0.13</td>
<td align="center" rowspan="1" colspan="1">64935</td>
<td align="center" rowspan="1" colspan="1">4584</td>
<td align="center" rowspan="1" colspan="1">0.70</td>
<td align="center" rowspan="1" colspan="1">11.47</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="tab2" orientation="portrait" position="float">
<label>Table 2</label>
<caption>
<p>Running time on the first group of simulated data sets.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="1" colspan="1">
<italic>l</italic>
</th>
<th align="center" rowspan="1" colspan="1">
<italic>g</italic>
</th>
<th align="center" rowspan="1" colspan="1">PairMotifChIP</th>
<th align="center" rowspan="1" colspan="1">MEME-ChIP</th>
<th align="center" rowspan="1" colspan="1">F-motif</th>
<th align="center" rowspan="1" colspan="1">PairMotif+</th>
<th align="center" rowspan="1" colspan="1">qPMS9</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3" align="left" colspan="1">9</td>
<td align="center" rowspan="1" colspan="1">0.2</td>
<td align="center" rowspan="1" colspan="1">26.3 s</td>
<td align="center" rowspan="1" colspan="1">1510.1 s</td>
<td align="center" rowspan="1" colspan="1">9.2 s</td>
<td align="center" rowspan="1" colspan="1">300.7 s</td>
<td align="center" rowspan="1" colspan="1">247.4 s</td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">0.5</td>
<td align="center" rowspan="1" colspan="1">21.3 s </td>
<td align="center" rowspan="1" colspan="1">1507.1 s</td>
<td align="center" rowspan="1" colspan="1">9.2 s</td>
<td align="center" rowspan="1" colspan="1">212.9 s</td>
<td align="center" rowspan="1" colspan="1">234.7 s </td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">0.8</td>
<td align="center" rowspan="1" colspan="1">18.7 s </td>
<td align="center" rowspan="1" colspan="1">1462.6 s</td>
<td align="center" rowspan="1" colspan="1">9.1 s</td>
<td align="center" rowspan="1" colspan="1">217.9 s</td>
<td align="center" rowspan="1" colspan="1">226.0 s</td>
</tr>
<tr>
<td align="center" colspan="7" rowspan="1">
<hr></hr>
</td>
</tr>
<tr>
<td rowspan="3" align="left" colspan="1">15</td>
<td align="center" rowspan="1" colspan="1">0.2</td>
<td align="center" rowspan="1" colspan="1">35.9 s</td>
<td align="center" rowspan="1" colspan="1">1325.0 s</td>
<td align="center" rowspan="1" colspan="1">16655.1 s</td>
<td align="center" rowspan="1" colspan="1">73048.5 s</td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">0.5</td>
<td align="center" rowspan="1" colspan="1">25.6 s</td>
<td align="center" rowspan="1" colspan="1">1354.9 s</td>
<td align="center" rowspan="1" colspan="1">16403.4 s</td>
<td align="center" rowspan="1" colspan="1">23549.0 s</td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">0.8</td>
<td align="center" rowspan="1" colspan="1">19.5 s</td>
<td align="center" rowspan="1" colspan="1">1466.7 s</td>
<td align="center" rowspan="1" colspan="1">15982.7 s</td>
<td align="center" rowspan="1" colspan="1">845.6 s</td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="center" colspan="7" rowspan="1">
<hr></hr>
</td>
</tr>
<tr>
<td rowspan="3" align="left" colspan="1">21</td>
<td align="center" rowspan="1" colspan="1">0.2</td>
<td align="center" rowspan="1" colspan="1">47.4 s</td>
<td align="center" rowspan="1" colspan="1">1425.5 s</td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">0.5</td>
<td align="center" rowspan="1" colspan="1">30.7 s</td>
<td align="center" rowspan="1" colspan="1">1148.5 s</td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">0.8</td>
<td align="center" rowspan="1" colspan="1">20.5 s</td>
<td align="center" rowspan="1" colspan="1">1349.2 s</td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>
<italic>Note</italic>
. s: seconds; —: over 24 hours.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<table-wrap id="tab3" orientation="portrait" position="float">
<label>Table 3</label>
<caption>
<p>Site-level identification accuracy on the first group of simulated data sets.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="1" colspan="1">
<italic>l</italic>
</th>
<th align="center" rowspan="1" colspan="1">
<italic>g</italic>
</th>
<th align="center" rowspan="1" colspan="1">PairMotifChIP</th>
<th align="center" rowspan="1" colspan="1">MEME-ChIP</th>
<th align="center" rowspan="1" colspan="1">F-motif</th>
<th align="center" rowspan="1" colspan="1">PairMotif+</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3" align="left" colspan="1">9</td>
<td align="center" rowspan="1" colspan="1">0.2</td>
<td align="center" rowspan="1" colspan="1">0.942</td>
<td align="center" rowspan="1" colspan="1">0.866</td>
<td align="center" rowspan="1" colspan="1">0.942</td>
<td align="center" rowspan="1" colspan="1">0.942</td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">0.5</td>
<td align="center" rowspan="1" colspan="1">0.902</td>
<td align="center" rowspan="1" colspan="1">0.734</td>
<td align="center" rowspan="1" colspan="1">0.902</td>
<td align="center" rowspan="1" colspan="1">0.902</td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">0.8</td>
<td align="center" rowspan="1" colspan="1">0.907</td>
<td align="center" rowspan="1" colspan="1">
<italic></italic>
</td>
<td align="center" rowspan="1" colspan="1">0.907</td>
<td align="center" rowspan="1" colspan="1">0.907</td>
</tr>
<tr>
<td align="center" colspan="6" rowspan="1">
<hr></hr>
</td>
</tr>
<tr>
<td rowspan="3" align="left" colspan="1">15</td>
<td align="center" rowspan="1" colspan="1">0.2</td>
<td align="center" rowspan="1" colspan="1">0.995</td>
<td align="center" rowspan="1" colspan="1">0.960</td>
<td align="center" rowspan="1" colspan="1">0.995</td>
<td align="center" rowspan="1" colspan="1">0.995</td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">0.5</td>
<td align="center" rowspan="1" colspan="1">0.969</td>
<td align="center" rowspan="1" colspan="1">0.916</td>
<td align="center" rowspan="1" colspan="1">0.969</td>
<td align="center" rowspan="1" colspan="1">0.969</td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">0.8</td>
<td align="center" rowspan="1" colspan="1">0.936</td>
<td align="center" rowspan="1" colspan="1">
<italic></italic>
</td>
<td align="center" rowspan="1" colspan="1">0.936</td>
<td align="center" rowspan="1" colspan="1">0.936</td>
</tr>
<tr>
<td align="center" colspan="6" rowspan="1">
<hr></hr>
</td>
</tr>
<tr>
<td rowspan="3" align="left" colspan="1">21</td>
<td align="center" rowspan="1" colspan="1">0.2</td>
<td align="center" rowspan="1" colspan="1">1.000</td>
<td align="center" rowspan="1" colspan="1">0.947</td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">0.5</td>
<td align="center" rowspan="1" colspan="1">0.988</td>
<td align="center" rowspan="1" colspan="1">0.953</td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">0.8</td>
<td align="center" rowspan="1" colspan="1">0.981</td>
<td align="center" rowspan="1" colspan="1">0.844</td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>
<italic>Note</italic>
. —: the result is not obtained because the running time is over 24 hours;
<sup>
<italic></italic>
</sup>
the result is not obtained because motif sites are not provided by MEME-ChIP on the corresponding data sets. The site-level identification accuracy is evaluated by the site-level performance coefficient sPC. Since qPMS9 and F-motif report the same motifs and have the same identification accuracy, the results of qPMS9 are not listed in this table. </p>
</fn>
</table-wrap-foot>
</table-wrap>
<table-wrap id="tab4" orientation="portrait" position="float">
<label>Table 4</label>
<caption>
<p>Nucleotide-level identification accuracy on the first group of simulated data sets.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th rowspan="2" align="left" colspan="1">
<italic>l</italic>
</th>
<th rowspan="2" align="center" colspan="1">
<italic>g</italic>
</th>
<th colspan="3" align="center" rowspan="1">PairMotifChIP</th>
<th colspan="3" align="center" rowspan="1">MEME-ChIP</th>
<th colspan="3" align="center" rowspan="1">F-motif</th>
<th colspan="3" align="center" rowspan="1">PairMotif+</th>
</tr>
<tr>
<th align="center" rowspan="1" colspan="1">nCC</th>
<th align="center" rowspan="1" colspan="1">nSn</th>
<th align="center" rowspan="1" colspan="1">nSp</th>
<th align="center" rowspan="1" colspan="1">nCC</th>
<th align="center" rowspan="1" colspan="1">nSn</th>
<th align="center" rowspan="1" colspan="1">nSp</th>
<th align="center" rowspan="1" colspan="1">nCC</th>
<th align="center" rowspan="1" colspan="1">nSn</th>
<th align="center" rowspan="1" colspan="1">nSp</th>
<th align="center" rowspan="1" colspan="1">nCC</th>
<th align="center" rowspan="1" colspan="1">nSn</th>
<th align="center" rowspan="1" colspan="1">nSp</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3" align="left" colspan="1">9</td>
<td align="center" rowspan="1" colspan="1">0.2</td>
<td align="center" rowspan="1" colspan="1">0.969</td>
<td align="center" rowspan="1" colspan="1">0.970</td>
<td align="center" rowspan="1" colspan="1">0.999</td>
<td align="center" rowspan="1" colspan="1">0.927</td>
<td align="center" rowspan="1" colspan="1">0.899</td>
<td align="center" rowspan="1" colspan="1">0.999</td>
<td align="center" rowspan="1" colspan="1">0.969</td>
<td align="center" rowspan="1" colspan="1">0.970</td>
<td align="center" rowspan="1" colspan="1">0.999</td>
<td align="center" rowspan="1" colspan="1">0.969</td>
<td align="center" rowspan="1" colspan="1">0.970</td>
<td align="center" rowspan="1" colspan="1">0.999</td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">0.5</td>
<td align="center" rowspan="1" colspan="1">0.947</td>
<td align="center" rowspan="1" colspan="1">0.949</td>
<td align="center" rowspan="1" colspan="1">0.998</td>
<td align="center" rowspan="1" colspan="1">0.849</td>
<td align="center" rowspan="1" colspan="1">0.762</td>
<td align="center" rowspan="1" colspan="1">0.999</td>
<td align="center" rowspan="1" colspan="1">0.947</td>
<td align="center" rowspan="1" colspan="1">0.949</td>
<td align="center" rowspan="1" colspan="1">0.998</td>
<td align="center" rowspan="1" colspan="1">0.947</td>
<td align="center" rowspan="1" colspan="1">0.949</td>
<td align="center" rowspan="1" colspan="1">0.998</td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">0.8</td>
<td align="center" rowspan="1" colspan="1">0.921</td>
<td align="center" rowspan="1" colspan="1">0.929</td>
<td align="center" rowspan="1" colspan="1">0.996</td>
<td align="center" rowspan="1" colspan="1">
<italic></italic>
</td>
<td align="center" rowspan="1" colspan="1">
<italic></italic>
</td>
<td align="center" rowspan="1" colspan="1">
<italic></italic>
</td>
<td align="center" rowspan="1" colspan="1">0.950</td>
<td align="center" rowspan="1" colspan="1">0.951</td>
<td align="center" rowspan="1" colspan="1">0.998</td>
<td align="center" rowspan="1" colspan="1">0.950</td>
<td align="center" rowspan="1" colspan="1">0.951</td>
<td align="center" rowspan="1" colspan="1">0.998</td>
</tr>
<tr>
<td align="center" colspan="14" rowspan="1">
<hr></hr>
</td>
</tr>
<tr>
<td rowspan="3" align="left" colspan="1">15</td>
<td align="center" rowspan="1" colspan="1">0.2</td>
<td align="center" rowspan="1" colspan="1">0.997</td>
<td align="center" rowspan="1" colspan="1">0.997</td>
<td align="center" rowspan="1" colspan="1">1.000</td>
<td align="center" rowspan="1" colspan="1">0.978</td>
<td align="center" rowspan="1" colspan="1">0.997</td>
<td align="center" rowspan="1" colspan="1">0.998</td>
<td align="center" rowspan="1" colspan="1">0.997</td>
<td align="center" rowspan="1" colspan="1">0.997</td>
<td align="center" rowspan="1" colspan="1">1.000</td>
<td align="center" rowspan="1" colspan="1">0.997</td>
<td align="center" rowspan="1" colspan="1">0.997</td>
<td align="center" rowspan="1" colspan="1">1.000</td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">0.5</td>
<td align="center" rowspan="1" colspan="1">0.983</td>
<td align="center" rowspan="1" colspan="1">0.984</td>
<td align="center" rowspan="1" colspan="1">0.999</td>
<td align="center" rowspan="1" colspan="1">0.952</td>
<td align="center" rowspan="1" colspan="1">0.950</td>
<td align="center" rowspan="1" colspan="1">0.997</td>
<td align="center" rowspan="1" colspan="1">0.983</td>
<td align="center" rowspan="1" colspan="1">0.984</td>
<td align="center" rowspan="1" colspan="1">0.999</td>
<td align="center" rowspan="1" colspan="1">0.983</td>
<td align="center" rowspan="1" colspan="1">0.984</td>
<td align="center" rowspan="1" colspan="1">0.999</td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">0.8</td>
<td align="center" rowspan="1" colspan="1">0.965</td>
<td align="center" rowspan="1" colspan="1">0.967</td>
<td align="center" rowspan="1" colspan="1">0.998</td>
<td align="center" rowspan="1" colspan="1">
<italic></italic>
</td>
<td align="center" rowspan="1" colspan="1">
<italic></italic>
</td>
<td align="center" rowspan="1" colspan="1">
<italic></italic>
</td>
<td align="center" rowspan="1" colspan="1">0.965</td>
<td align="center" rowspan="1" colspan="1">0.967</td>
<td align="center" rowspan="1" colspan="1">0.998</td>
<td align="center" rowspan="1" colspan="1">0.965</td>
<td align="center" rowspan="1" colspan="1">0.967</td>
<td align="center" rowspan="1" colspan="1">0.998</td>
</tr>
<tr>
<td align="center" colspan="14" rowspan="1">
<hr></hr>
</td>
</tr>
<tr>
<td rowspan="3" align="left" colspan="1">21</td>
<td align="center" rowspan="1" colspan="1">0.2</td>
<td align="center" rowspan="1" colspan="1">1.000</td>
<td align="center" rowspan="1" colspan="1">1.000</td>
<td align="center" rowspan="1" colspan="1">1.000</td>
<td align="center" rowspan="1" colspan="1">0.969</td>
<td align="center" rowspan="1" colspan="1">1.000</td>
<td align="center" rowspan="1" colspan="1">0.994</td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">0.5</td>
<td align="center" rowspan="1" colspan="1">0.993</td>
<td align="center" rowspan="1" colspan="1">0.994</td>
<td align="center" rowspan="1" colspan="1">0.999</td>
<td align="center" rowspan="1" colspan="1">0.972</td>
<td align="center" rowspan="1" colspan="1">0.955</td>
<td align="center" rowspan="1" colspan="1">0.996</td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">0.8</td>
<td align="center" rowspan="1" colspan="1">0.989</td>
<td align="center" rowspan="1" colspan="1">0.990</td>
<td align="center" rowspan="1" colspan="1">0.999</td>
<td align="center" rowspan="1" colspan="1">0.921</td>
<td align="center" rowspan="1" colspan="1">0.906</td>
<td align="center" rowspan="1" colspan="1">0.996</td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>
<italic>Note</italic>
. —: the result is not obtained because the running time is over 24 hours;
<sup>
<italic></italic>
</sup>
the result is not obtained because motif sites are not provided by MEME-ChIP on the corresponding data sets. Since qPMS9 and F-motif report the same motifs and have the same identification accuracy, the results of qPMS9 are not listed in this table. Besides the nucleotide-level identification accuracy nCC, the sensitivity nSn and specificity nSp are also listed in this table.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<table-wrap id="tab5" orientation="portrait" position="float">
<label>Table 5</label>
<caption>
<p>Running time and identification accuracy on the second group of simulated data sets.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="1" colspan="1">
<italic>l</italic>
</th>
<th align="center" rowspan="1" colspan="1">Sequence #</th>
<th align="center" rowspan="1" colspan="1">PairMotifChIP</th>
<th align="center" rowspan="1" colspan="1">F-motif</th>
<th align="center" rowspan="1" colspan="1">PairMotif+</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="6" align="left" colspan="1">9</td>
<td align="center" rowspan="1" colspan="1">500</td>
<td align="center" rowspan="1" colspan="1">14.4 s (0.955)</td>
<td align="center" rowspan="1" colspan="1">7.8 s (0.955)</td>
<td align="center" rowspan="1" colspan="1">68.1 s (0.628)</td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">1000</td>
<td align="center" rowspan="1" colspan="1">60.6 s (0.945)</td>
<td align="center" rowspan="1" colspan="1">17.2 s (0.945)</td>
<td align="center" rowspan="1" colspan="1">410.1 s (0.945)</td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">1500</td>
<td align="center" rowspan="1" colspan="1">133.3 s (0.953)</td>
<td align="center" rowspan="1" colspan="1">27.8 s (0.953)</td>
<td align="center" rowspan="1" colspan="1">989.9 s (0.953)</td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">2000</td>
<td align="center" rowspan="1" colspan="1">231.4 s (0.953)</td>
<td align="center" rowspan="1" colspan="1">40.0 s (0.953)</td>
<td align="center" rowspan="1" colspan="1">1704.1 s (0.953)</td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">2500</td>
<td align="center" rowspan="1" colspan="1">361.4 s (0.951)</td>
<td align="center" rowspan="1" colspan="1">52.8 s (0.951)</td>
<td align="center" rowspan="1" colspan="1">3012.7 s (0.951)</td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">3000</td>
<td align="center" rowspan="1" colspan="1">519.2 s (0.955)</td>
<td align="center" rowspan="1" colspan="1">67.2 s (0.955)</td>
<td align="center" rowspan="1" colspan="1">4307.4 s (0.955)</td>
</tr>
<tr>
<td align="center" colspan="5" rowspan="1">
<hr></hr>
</td>
</tr>
<tr>
<td rowspan="6" align="left" colspan="1">15</td>
<td align="center" rowspan="1" colspan="1">500</td>
<td align="center" rowspan="1" colspan="1">17.9 s (0.986)</td>
<td align="center" rowspan="1" colspan="1">13581.7 s (0.986)</td>
<td align="center" rowspan="1" colspan="1">14394.4 s (0.986)</td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">1000</td>
<td align="center" rowspan="1" colspan="1">74.8 s (0.983)</td>
<td align="center" rowspan="1" colspan="1">30293.2 s (0.983)</td>
<td align="center" rowspan="1" colspan="1">35172.2 s (0.983)</td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">1500</td>
<td align="center" rowspan="1" colspan="1">150.9 s (0.980)</td>
<td align="center" rowspan="1" colspan="1">50102.5 s (0.980)</td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">2000</td>
<td align="center" rowspan="1" colspan="1">253.0 s (0.981)</td>
<td align="center" rowspan="1" colspan="1">66344.7 s (0.981)</td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">2500</td>
<td align="center" rowspan="1" colspan="1">396.9 s (0.982)</td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">3000</td>
<td align="center" rowspan="1" colspan="1">554.4 s (0.981)</td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="center" colspan="5" rowspan="1">
<hr></hr>
</td>
</tr>
<tr>
<td rowspan="6" align="left" colspan="1">21</td>
<td align="center" rowspan="1" colspan="1">500</td>
<td align="center" rowspan="1" colspan="1">22.9 s (0.995)</td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">1000</td>
<td align="center" rowspan="1" colspan="1">90.5 s (0.996)</td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">1500</td>
<td align="center" rowspan="1" colspan="1">171.6 s (0.995)</td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">2000</td>
<td align="center" rowspan="1" colspan="1">277.2 s (0.995)</td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">2500</td>
<td align="center" rowspan="1" colspan="1">423.8 s (0.996)</td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="center" rowspan="1" colspan="1">3000</td>
<td align="center" rowspan="1" colspan="1">592.2 s (0.996)</td>
<td align="center" rowspan="1" colspan="1"></td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>
<italic>Note.</italic>
s: seconds; —: over 24 hours. The number after each running time is the corresponding nucleotide-level identification accuracy nCC.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<table-wrap id="tab6" orientation="portrait" position="float">
<label>Table 6</label>
<caption>
<p>Running time of methods for extracting pairs of
<italic>l</italic>
-mers.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="1" colspan="1">Sequence
<italic> #</italic>
</th>
<th align="center" rowspan="1" colspan="1">Method in this paper</th>
<th align="center" rowspan="1" colspan="1">Method in [
<xref rid="B26" ref-type="bibr">26</xref>
]</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">200</td>
<td align="center" rowspan="1" colspan="1">2.2 s</td>
<td align="center" rowspan="1" colspan="1">23.6 s</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">400</td>
<td align="center" rowspan="1" colspan="1">8.7 s</td>
<td align="center" rowspan="1" colspan="1">96.1 s</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">600</td>
<td align="center" rowspan="1" colspan="1">19.7 s</td>
<td align="center" rowspan="1" colspan="1">197.9 s</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">800</td>
<td align="center" rowspan="1" colspan="1">34.7 s</td>
<td align="center" rowspan="1" colspan="1">331.7 s</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">1000</td>
<td align="center" rowspan="1" colspan="1">54.3 s</td>
<td align="center" rowspan="1" colspan="1">518.1 s</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">1200</td>
<td align="center" rowspan="1" colspan="1">78.4 s</td>
<td align="center" rowspan="1" colspan="1">741.6 s</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">1400</td>
<td align="center" rowspan="1" colspan="1">109.0 s</td>
<td align="center" rowspan="1" colspan="1">1015.4 s</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">1600</td>
<td align="center" rowspan="1" colspan="1">140.5 s</td>
<td align="center" rowspan="1" colspan="1">1334.1 s</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">1800</td>
<td align="center" rowspan="1" colspan="1">178.3 s</td>
<td align="center" rowspan="1" colspan="1">1731.2 s</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">2000</td>
<td align="center" rowspan="1" colspan="1">223.2 s</td>
<td align="center" rowspan="1" colspan="1">2163.5 s</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>
<italic>Note</italic>
. s: seconds.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<table-wrap id="tab7" orientation="portrait" position="float">
<label>Table 7</label>
<caption>
<p>Results on the mESC data.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th rowspan="2" align="left" colspan="1">Data set</th>
<th rowspan="2" align="center" colspan="1">Published motif</th>
<th colspan="2" align="center" rowspan="1">PairMotifChIP</th>
<th colspan="2" align="center" rowspan="1">PairMotif+</th>
</tr>
<tr>
<th align="center" rowspan="1" colspan="1">Time</th>
<th align="center" rowspan="1" colspan="1">Predicted motif</th>
<th align="center" rowspan="1" colspan="1">Time</th>
<th align="center" rowspan="1" colspan="1">Predicted motif</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">c-Myc</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i001.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">37.2 s</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i002.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">4106.1 s</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i003.jpg"></inline-graphic>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">CTCF</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i004.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">29.1 s</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i005.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">23584.3 s</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i006.jpg"></inline-graphic>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Esrrb</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i007.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">25.6 s</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i008.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">7424.6 s</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i009.jpg"></inline-graphic>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Klf4</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i010.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">29.3 s</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i011.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">3558.5 s</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i012.jpg"></inline-graphic>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Nanog</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i013.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">24.3 s</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i014.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">1975.6 s</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i015.jpg"></inline-graphic>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">n-Myc</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i016.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">36.3 s</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i017.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">33962.6 s</td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Oct4</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i018.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">8.9 s</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i019.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">2608.8 s</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i020.jpg"></inline-graphic>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Smad1</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i021.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">20.3 s</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i022.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">5296.1 s</td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Sox2</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i023.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">23.1 s</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i024.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">4115.2 s</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i025.jpg"></inline-graphic>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">STAT3</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i026.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">22.9 s</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i027.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">6342.6 s</td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Tcfcp2I1</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i028.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">23.5 s</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i029.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">2269.5 s</td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Zfx</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i030.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">42.2 s</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i031.jpg"></inline-graphic>
</td>
<td align="center" rowspan="1" colspan="1">3617.2 s</td>
<td align="center" rowspan="1" colspan="1">
<inline-graphic xlink:href="BMRI2016-4986707.tab7.i032.jpg"></inline-graphic>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>
<italic>Note.</italic>
—: there is no motif overlapping the published motif in the top ten predicted motifs.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</floats-group>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000B32 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000B32 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:5098105
   |texte=   PairMotifChIP: A Fast Algorithm for Discovery of Patterns Conserved in Large ChIP-seq Data Sets
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:27843946" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021