Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 0002728 ( Pmc/Corpus ); précédent : 0002727; suivant : 0002729 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform</title>
<author>
<name sortKey="Lin, Jie" sort="Lin, Jie" uniqKey="Lin J" first="Jie" last="Lin">Jie Lin</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9271 2478</institution-id>
<institution-id institution-id-type="GRID">grid.411503.2</institution-id>
<institution>College of Mathematics and Informatics, Fujian Normal University,</institution>
</institution-wrap>
Fuzhou, 350108 People’s Republic of China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Wei, Jing" sort="Wei, Jing" uniqKey="Wei J" first="Jing" last="Wei">Jing Wei</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9271 2478</institution-id>
<institution-id institution-id-type="GRID">grid.411503.2</institution-id>
<institution>College of Mathematics and Informatics, Fujian Normal University,</institution>
</institution-wrap>
Fuzhou, 350108 People’s Republic of China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Adjeroh, Donald" sort="Adjeroh, Donald" uniqKey="Adjeroh D" first="Donald" last="Adjeroh">Donald Adjeroh</name>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2156 6140</institution-id>
<institution-id institution-id-type="GRID">grid.268154.c</institution-id>
<institution>Lane Department of Computer Science and Electrical Engineering, West Virginia University,</institution>
</institution-wrap>
Morgantown, 26506 WV USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Jiang, Bing Hua" sort="Jiang, Bing Hua" uniqKey="Jiang B" first="Bing-Hua" last="Jiang">Bing-Hua Jiang</name>
<affiliation>
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1936 8294</institution-id>
<institution-id institution-id-type="GRID">grid.214572.7</institution-id>
<institution>Department of Pathology, University of Iowa,</institution>
</institution-wrap>
Iowa city, 52242 Iowa USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Jiang, Yue" sort="Jiang, Yue" uniqKey="Jiang Y" first="Yue" last="Jiang">Yue Jiang</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9271 2478</institution-id>
<institution-id institution-id-type="GRID">grid.411503.2</institution-id>
<institution>College of Mathematics and Informatics, Fujian Normal University,</institution>
</institution-wrap>
Fuzhou, 350108 People’s Republic of China</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">29720081</idno>
<idno type="pmc">5930706</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5930706</idno>
<idno type="RBID">PMC:5930706</idno>
<idno type="doi">10.1186/s12859-018-2155-9</idno>
<date when="2018">2018</date>
<idno type="wicri:Area/Pmc/Corpus">000272</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000272</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform</title>
<author>
<name sortKey="Lin, Jie" sort="Lin, Jie" uniqKey="Lin J" first="Jie" last="Lin">Jie Lin</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9271 2478</institution-id>
<institution-id institution-id-type="GRID">grid.411503.2</institution-id>
<institution>College of Mathematics and Informatics, Fujian Normal University,</institution>
</institution-wrap>
Fuzhou, 350108 People’s Republic of China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Wei, Jing" sort="Wei, Jing" uniqKey="Wei J" first="Jing" last="Wei">Jing Wei</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9271 2478</institution-id>
<institution-id institution-id-type="GRID">grid.411503.2</institution-id>
<institution>College of Mathematics and Informatics, Fujian Normal University,</institution>
</institution-wrap>
Fuzhou, 350108 People’s Republic of China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Adjeroh, Donald" sort="Adjeroh, Donald" uniqKey="Adjeroh D" first="Donald" last="Adjeroh">Donald Adjeroh</name>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2156 6140</institution-id>
<institution-id institution-id-type="GRID">grid.268154.c</institution-id>
<institution>Lane Department of Computer Science and Electrical Engineering, West Virginia University,</institution>
</institution-wrap>
Morgantown, 26506 WV USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Jiang, Bing Hua" sort="Jiang, Bing Hua" uniqKey="Jiang B" first="Bing-Hua" last="Jiang">Bing-Hua Jiang</name>
<affiliation>
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1936 8294</institution-id>
<institution-id institution-id-type="GRID">grid.214572.7</institution-id>
<institution>Department of Pathology, University of Iowa,</institution>
</institution-wrap>
Iowa city, 52242 Iowa USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Jiang, Yue" sort="Jiang, Yue" uniqKey="Jiang Y" first="Yue" last="Jiang">Yue Jiang</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9271 2478</institution-id>
<institution-id institution-id-type="GRID">grid.411503.2</institution-id>
<institution>College of Mathematics and Informatics, Fujian Normal University,</institution>
</institution-wrap>
Fuzhou, 350108 People’s Republic of China</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2018">2018</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>Alignment-free sequence similarity analysis methods often lead to significant savings in computational time over alignment-based counterparts.</p>
</sec>
<sec>
<title>Results</title>
<p>A new alignment-free sequence similarity analysis method, called SSAW is proposed. SSAW stands for Sequence Similarity Analysis using the Stationary Discrete Wavelet Transform (SDWT). It extracts
<italic>k</italic>
-mers from a sequence, then maps each
<italic>k</italic>
-mer to a complex number field. Then, the series of complex numbers formed are transformed into feature vectors using the stationary discrete wavelet transform. After these steps, the original sequence is turned into a feature vector with numeric values, which can then be used for clustering and/or classification.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>Using two different types of applications, namely, clustering and classification, we compared SSAW against the the-state-of-the-art alignment free sequence analysis methods. SSAW demonstrates competitive or superior performance in terms of standard indicators, such as accuracy, F-score, precision, and recall. The running time was significantly better in most cases. These make SSAW a suitable method for sequence analysis, especially, given the rapidly increasing volumes of sequence data required by most modern applications.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zielezinski, A" uniqKey="Zielezinski A">A Zielezinski</name>
</author>
<author>
<name sortKey="Vinga, S" uniqKey="Vinga S">S Vinga</name>
</author>
<author>
<name sortKey="Almeida, J" uniqKey="Almeida J">J Almeida</name>
</author>
<author>
<name sortKey="Karlowski, Wm" uniqKey="Karlowski W">WM Karlowski</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vinga, S" uniqKey="Vinga S">S Vinga</name>
</author>
<author>
<name sortKey="Almeida, J" uniqKey="Almeida J">J Almeida</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pratas, D" uniqKey="Pratas D">D Pratas</name>
</author>
<author>
<name sortKey="Silva, R M" uniqKey="Silva R">R. M Silva</name>
</author>
<author>
<name sortKey="Pinho, A J" uniqKey="Pinho A">A. J Pinho</name>
</author>
<author>
<name sortKey="Ferreira, Pjsg" uniqKey="Ferreira P">PJSG Ferreira</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Guillaume, H" uniqKey="Guillaume H">H Guillaume</name>
</author>
<author>
<name sortKey="Roland, W" uniqKey="Roland W">W Roland</name>
</author>
<author>
<name sortKey="Jens, S" uniqKey="Jens S">S Jens</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pizzi, C" uniqKey="Pizzi C">C Pizzi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Thankachan, Sv" uniqKey="Thankachan S">SV Thankachan</name>
</author>
<author>
<name sortKey="Chockalingam, Sp" uniqKey="Chockalingam S">SP Chockalingam</name>
</author>
<author>
<name sortKey="Liu, Y" uniqKey="Liu Y">Y Liu</name>
</author>
<author>
<name sortKey="Krishnan, A" uniqKey="Krishnan A">A Krishnan</name>
</author>
<author>
<name sortKey="Aluru, S" uniqKey="Aluru S">S Aluru</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="He, L" uniqKey="He L">L He</name>
</author>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
<author>
<name sortKey="Rong, Lh" uniqKey="Rong L">LH Rong</name>
</author>
<author>
<name sortKey="Yau, St" uniqKey="Yau S">ST Yau</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tripathi, P" uniqKey="Tripathi P">P Tripathi</name>
</author>
<author>
<name sortKey="Pandey, P N" uniqKey="Pandey P">P. N Pandey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pajuste, Fd" uniqKey="Pajuste F">FD Pajuste</name>
</author>
<author>
<name sortKey="Kaplinski, L" uniqKey="Kaplinski L">L Kaplinski</name>
</author>
<author>
<name sortKey="Mols, M" uniqKey="Mols M">M Mols</name>
</author>
<author>
<name sortKey="Puurand, T" uniqKey="Puurand T">T Puurand</name>
</author>
<author>
<name sortKey="Lepamets, M" uniqKey="Lepamets M">M Lepamets</name>
</author>
<author>
<name sortKey="Remm, M" uniqKey="Remm M">M Remm</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rudewicz, J" uniqKey="Rudewicz J">J Rudewicz</name>
</author>
<author>
<name sortKey="Soueidan, H" uniqKey="Soueidan H">H Soueidan</name>
</author>
<author>
<name sortKey="Uricaru, R" uniqKey="Uricaru R">R Uricaru</name>
</author>
<author>
<name sortKey="Bonnefoi, H" uniqKey="Bonnefoi H">H Bonnefoi</name>
</author>
<author>
<name sortKey="Iggo, R" uniqKey="Iggo R">R Iggo</name>
</author>
<author>
<name sortKey="Bergh, J" uniqKey="Bergh J">J Bergh</name>
</author>
<author>
<name sortKey="Nikolski, M" uniqKey="Nikolski M">M Nikolski</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cong, Y" uniqKey="Cong Y">Y Cong</name>
</author>
<author>
<name sortKey="Chan, Yb" uniqKey="Chan Y">YB Chan</name>
</author>
<author>
<name sortKey="Ragan, Ma" uniqKey="Ragan M">MA Ragan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bromberg, R" uniqKey="Bromberg R">R Bromberg</name>
</author>
<author>
<name sortKey="Grishin, N V" uniqKey="Grishin N">N. V Grishin</name>
</author>
<author>
<name sortKey="Otwinowski, Z" uniqKey="Otwinowski Z">Z Otwinowski</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brittnacher, Mj" uniqKey="Brittnacher M">MJ Brittnacher</name>
</author>
<author>
<name sortKey="Heltshe, Sl" uniqKey="Heltshe S">SL Heltshe</name>
</author>
<author>
<name sortKey="Hayden, Hs" uniqKey="Hayden H">HS Hayden</name>
</author>
<author>
<name sortKey="Radey, Mc" uniqKey="Radey M">MC Radey</name>
</author>
<author>
<name sortKey="Weiss, Ej" uniqKey="Weiss E">EJ Weiss</name>
</author>
<author>
<name sortKey="Damman, Cj" uniqKey="Damman C">CJ Damman</name>
</author>
<author>
<name sortKey="Zisman, Tl" uniqKey="Zisman T">TL Zisman</name>
</author>
<author>
<name sortKey="Suskind, Dl" uniqKey="Suskind D">DL Suskind</name>
</author>
<author>
<name sortKey="Miller, Si" uniqKey="Miller S">SI Miller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pham, Dt" uniqKey="Pham D">DT Pham</name>
</author>
<author>
<name sortKey="Gao, S" uniqKey="Gao S">S Gao</name>
</author>
<author>
<name sortKey="Phan, V" uniqKey="Phan V">V Phan</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yaveroglu, O N" uniqKey="Yaveroglu O">O. N Yaveroglu</name>
</author>
<author>
<name sortKey="Milenkovic, T" uniqKey="Milenkovic T">T Milenkovic</name>
</author>
<author>
<name sortKey="Przulj, N" uniqKey="Przulj N">N Przulj</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Qian, Z" uniqKey="Qian Z">Z Qian</name>
</author>
<author>
<name sortKey="Jun, S R" uniqKey="Jun S">S. R Jun</name>
</author>
<author>
<name sortKey="Leuze, M" uniqKey="Leuze M">M Leuze</name>
</author>
<author>
<name sortKey="Ussery, D" uniqKey="Ussery D">D Ussery</name>
</author>
<author>
<name sortKey="Nookaew, I" uniqKey="Nookaew I">I Nookaew</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
<author>
<name sortKey="He, L" uniqKey="He L">L He</name>
</author>
<author>
<name sortKey="He, Rl" uniqKey="He R">RL He</name>
</author>
<author>
<name sortKey="Yau, Ss" uniqKey="Yau S">SS Yau</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Golia, B" uniqKey="Golia B">B Golia</name>
</author>
<author>
<name sortKey="Moeller, Gk" uniqKey="Moeller G">GK Moeller</name>
</author>
<author>
<name sortKey="Jankevicius, G" uniqKey="Jankevicius G">G Jankevicius</name>
</author>
<author>
<name sortKey="Schmidt, A" uniqKey="Schmidt A">A Schmidt</name>
</author>
<author>
<name sortKey="Hegele, A" uniqKey="Hegele A">A Hegele</name>
</author>
<author>
<name sortKey="Preiber, J" uniqKey="Preiber J">J PreiBer</name>
</author>
<author>
<name sortKey="Mai, Lt" uniqKey="Mai L">LT Mai</name>
</author>
<author>
<name sortKey="Imhof, A" uniqKey="Imhof A">A Imhof</name>
</author>
<author>
<name sortKey="Timinszky, G" uniqKey="Timinszky G">G Timinszky</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Madsen, Mh" uniqKey="Madsen M">MH Madsen</name>
</author>
<author>
<name sortKey="Boher, P" uniqKey="Boher P">P Boher</name>
</author>
<author>
<name sortKey="Hansen, Pe" uniqKey="Hansen P">PE Hansen</name>
</author>
<author>
<name sortKey="J Rgensen, Jf" uniqKey="J Rgensen J">JF Jørgensen</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bonhamcarter, O" uniqKey="Bonhamcarter O">O Bonhamcarter</name>
</author>
<author>
<name sortKey="Steele, J" uniqKey="Steele J">J Steele</name>
</author>
<author>
<name sortKey="Bastola, D" uniqKey="Bastola D">D Bastola</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vinga, S" uniqKey="Vinga S">S Vinga</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, M" uniqKey="Li M">M Li</name>
</author>
<author>
<name sortKey="Badger, J" uniqKey="Badger J">J Badger</name>
</author>
<author>
<name sortKey="Chen, X" uniqKey="Chen X">X Chen</name>
</author>
<author>
<name sortKey="Kwong, S" uniqKey="Kwong S">S Kwong</name>
</author>
<author>
<name sortKey="Kearney, P" uniqKey="Kearney P">P Kearney</name>
</author>
<author>
<name sortKey="Zhang, H" uniqKey="Zhang H">H Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dai, Q" uniqKey="Dai Q">Q Dai</name>
</author>
<author>
<name sortKey="Li, L" uniqKey="Li L">L Li</name>
</author>
<author>
<name sortKey="Liu, X" uniqKey="Liu X">X Liu</name>
</author>
<author>
<name sortKey="Yao, Y" uniqKey="Yao Y">Y Yao</name>
</author>
<author>
<name sortKey="Zhao, F" uniqKey="Zhao F">F Zhao</name>
</author>
<author>
<name sortKey="Zhang, M" uniqKey="Zhang M">M Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bauer, M" uniqKey="Bauer M">M Bauer</name>
</author>
<author>
<name sortKey="Schuster, Sm" uniqKey="Schuster S">SM Schuster</name>
</author>
<author>
<name sortKey="Sayood, K" uniqKey="Sayood K">K Sayood</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Blaisdell, Be" uniqKey="Blaisdell B">BE Blaisdell</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dan, W" uniqKey="Dan W">W Dan</name>
</author>
<author>
<name sortKey="Jiang, Q" uniqKey="Jiang Q">Q Jiang</name>
</author>
<author>
<name sortKey="Wei, Y" uniqKey="Wei Y">Y Wei</name>
</author>
<author>
<name sortKey="Wang, S" uniqKey="Wang S">S Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Qi, J" uniqKey="Qi J">J Qi</name>
</author>
<author>
<name sortKey="Wang, B" uniqKey="Wang B">B Wang</name>
</author>
<author>
<name sortKey="Hao, B I" uniqKey="Hao B">B. I Hao</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pham, T D" uniqKey="Pham T">T. D Pham</name>
</author>
<author>
<name sortKey="Zuegg, J" uniqKey="Zuegg J">J Zuegg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wu, Tj" uniqKey="Wu T">TJ Wu</name>
</author>
<author>
<name sortKey="Burke, Jp" uniqKey="Burke J">JP Burke</name>
</author>
<author>
<name sortKey="Davison, Db" uniqKey="Davison D">DB Davison</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wu, Tj" uniqKey="Wu T">TJ Wu</name>
</author>
<author>
<name sortKey="Hsieh, Yc" uniqKey="Hsieh Y">YC Hsieh</name>
</author>
<author>
<name sortKey="Li, La" uniqKey="Li L">LA Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bai, F" uniqKey="Bai F">F Bai</name>
</author>
<author>
<name sortKey="Wang, T" uniqKey="Wang T">T Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Leimeister, Ca" uniqKey="Leimeister C">CA Leimeister</name>
</author>
<author>
<name sortKey="Boden, M" uniqKey="Boden M">M Boden</name>
</author>
<author>
<name sortKey="Horwege, S" uniqKey="Horwege S">S Horwege</name>
</author>
<author>
<name sortKey="Lindner, S" uniqKey="Lindner S">S Lindner</name>
</author>
<author>
<name sortKey="Morgenstern, B" uniqKey="Morgenstern B">B Morgenstern</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Schimd, M" uniqKey="Schimd M">M Schimd</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schwende, I" uniqKey="Schwende I">I Schwende</name>
</author>
<author>
<name sortKey="Pham, Td" uniqKey="Pham T">TD Pham</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bao, Jp" uniqKey="Bao J">JP Bao</name>
</author>
<author>
<name sortKey="Yuan, Ry" uniqKey="Yuan R">RY Yuan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pyatkov, Mi" uniqKey="Pyatkov M">MI Pyatkov</name>
</author>
<author>
<name sortKey="Pankratov, An" uniqKey="Pankratov A">AN Pankratov</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cheever, Ea" uniqKey="Cheever E">EA Cheever</name>
</author>
<author>
<name sortKey="Overton, Gc" uniqKey="Overton G">GC Overton</name>
</author>
<author>
<name sortKey="Searls, Db" uniqKey="Searls D">DB Searls</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pal, J" uniqKey="Pal J">J Pal</name>
</author>
<author>
<name sortKey="Ghosh, S" uniqKey="Ghosh S">S Ghosh</name>
</author>
<author>
<name sortKey="Maji, B" uniqKey="Maji B">B Maji</name>
</author>
<author>
<name sortKey="Bhattacharya, Dk" uniqKey="Bhattacharya D">DK Bhattacharya</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Grabherr, Mg" uniqKey="Grabherr M">MG Grabherr</name>
</author>
<author>
<name sortKey="Russell, P" uniqKey="Russell P">P Russell</name>
</author>
<author>
<name sortKey="Meyer, M" uniqKey="Meyer M">M Meyer</name>
</author>
<author>
<name sortKey="Mauceli, E" uniqKey="Mauceli E">E Mauceli</name>
</author>
<author>
<name sortKey="Alfoldi, J" uniqKey="Alfoldi J">J Alföldi</name>
</author>
<author>
<name sortKey="Di, Pf" uniqKey="Di P">PF Di</name>
</author>
<author>
<name sortKey="Lindblad Toh, K" uniqKey="Lindblad Toh K">K Lindblad-Toh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chaovalit, P" uniqKey="Chaovalit P">P Chaovalit</name>
</author>
<author>
<name sortKey="Gangopadhyay, A" uniqKey="Gangopadhyay A">A Gangopadhyay</name>
</author>
<author>
<name sortKey="Karabatis, G" uniqKey="Karabatis G">G Karabatis</name>
</author>
<author>
<name sortKey="Chen, Z" uniqKey="Chen Z">Z Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tsonis, Aa" uniqKey="Tsonis A">AA Tsonis</name>
</author>
<author>
<name sortKey="Kumar, P" uniqKey="Kumar P">P Kumar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Haimovich, Ad" uniqKey="Haimovich A">AD Haimovich</name>
</author>
<author>
<name sortKey="Byrne, B" uniqKey="Byrne B">B Byrne</name>
</author>
<author>
<name sortKey="Ramaswamy, R" uniqKey="Ramaswamy R">R Ramaswamy</name>
</author>
<author>
<name sortKey="Welsh, Wj" uniqKey="Welsh W">WJ Welsh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nanni, L" uniqKey="Nanni L">L Nanni</name>
</author>
<author>
<name sortKey="Brahnam, S" uniqKey="Brahnam S">S Brahnam</name>
</author>
<author>
<name sortKey="Lumini, A" uniqKey="Lumini A">A Lumini</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Abbasi, O" uniqKey="Abbasi O">O Abbasi</name>
</author>
<author>
<name sortKey="Rostami, A" uniqKey="Rostami A">A Rostami</name>
</author>
<author>
<name sortKey="Karimian, G" uniqKey="Karimian G">G Karimian</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Athanasiadis, Ei" uniqKey="Athanasiadis E">EI Athanasiadis</name>
</author>
<author>
<name sortKey="Cavouras, Da" uniqKey="Cavouras D">DA Cavouras</name>
</author>
<author>
<name sortKey="Glotsos, Dt" uniqKey="Glotsos D">DT Glotsos</name>
</author>
<author>
<name sortKey="Georgiadis, Pv" uniqKey="Georgiadis P">PV Georgiadis</name>
</author>
<author>
<name sortKey="Kalatzis, Ik" uniqKey="Kalatzis I">IK Kalatzis</name>
</author>
<author>
<name sortKey="Nikiforidis, Gc" uniqKey="Nikiforidis G">GC Nikiforidis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, C" uniqKey="Yang C">C Yang</name>
</author>
<author>
<name sortKey="Liu, P" uniqKey="Liu P">P Liu</name>
</author>
<author>
<name sortKey="Yin, G" uniqKey="Yin G">G Yin</name>
</author>
<author>
<name sortKey="Jiang, H" uniqKey="Jiang H">H Jiang</name>
</author>
<author>
<name sortKey="Li, X" uniqKey="Li X">X Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lonard, M" uniqKey="Lonard M">M Lonard</name>
</author>
<author>
<name sortKey="Mouchard, L" uniqKey="Mouchard L">L Mouchard</name>
</author>
<author>
<name sortKey="Salson, M" uniqKey="Salson M">M Salson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fowler, J E" uniqKey="Fowler J">J. E Fowler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, X" uniqKey="Yang X">X Yang</name>
</author>
<author>
<name sortKey="Chockalingam, Sp" uniqKey="Chockalingam S">SP Chockalingam</name>
</author>
<author>
<name sortKey="Aluru, S" uniqKey="Aluru S">S Aluru</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Needleman, S B" uniqKey="Needleman S">S. B Needleman</name>
</author>
<author>
<name sortKey="Wunsch, C D" uniqKey="Wunsch C">C. D Wunsch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wagner, R A" uniqKey="Wagner R">R. A Wagner</name>
</author>
<author>
<name sortKey="Fischer, M J" uniqKey="Fischer M">M. J Fischer</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">29720081</article-id>
<article-id pub-id-type="pmc">5930706</article-id>
<article-id pub-id-type="publisher-id">2155</article-id>
<article-id pub-id-type="doi">10.1186/s12859-018-2155-9</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Lin</surname>
<given-names>Jie</given-names>
</name>
<address>
<email>linjie891@163.com</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Wei</surname>
<given-names>Jing</given-names>
</name>
<address>
<email>JingWeixy2012@163.com</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Adjeroh</surname>
<given-names>Donald</given-names>
</name>
<address>
<email>Donald.Adjeroh@mail.wvu.edu</email>
</address>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Jiang</surname>
<given-names>Bing-Hua</given-names>
</name>
<address>
<email>binghjiang@yahoo.com</email>
</address>
<xref ref-type="aff" rid="Aff3">3</xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Jiang</surname>
<given-names>Yue</given-names>
</name>
<address>
<email>yueljiang@163.com</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<aff id="Aff1">
<label>1</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9271 2478</institution-id>
<institution-id institution-id-type="GRID">grid.411503.2</institution-id>
<institution>College of Mathematics and Informatics, Fujian Normal University,</institution>
</institution-wrap>
Fuzhou, 350108 People’s Republic of China</aff>
<aff id="Aff2">
<label>2</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2156 6140</institution-id>
<institution-id institution-id-type="GRID">grid.268154.c</institution-id>
<institution>Lane Department of Computer Science and Electrical Engineering, West Virginia University,</institution>
</institution-wrap>
Morgantown, 26506 WV USA</aff>
<aff id="Aff3">
<label>3</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1936 8294</institution-id>
<institution-id institution-id-type="GRID">grid.214572.7</institution-id>
<institution>Department of Pathology, University of Iowa,</institution>
</institution-wrap>
Iowa city, 52242 Iowa USA</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>2</day>
<month>5</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>2</day>
<month>5</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="collection">
<year>2018</year>
</pub-date>
<volume>19</volume>
<elocation-id>165</elocation-id>
<history>
<date date-type="received">
<day>19</day>
<month>11</month>
<year>2017</year>
</date>
<date date-type="accepted">
<day>11</day>
<month>4</month>
<year>2018</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s) 2018</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p>Alignment-free sequence similarity analysis methods often lead to significant savings in computational time over alignment-based counterparts.</p>
</sec>
<sec>
<title>Results</title>
<p>A new alignment-free sequence similarity analysis method, called SSAW is proposed. SSAW stands for Sequence Similarity Analysis using the Stationary Discrete Wavelet Transform (SDWT). It extracts
<italic>k</italic>
-mers from a sequence, then maps each
<italic>k</italic>
-mer to a complex number field. Then, the series of complex numbers formed are transformed into feature vectors using the stationary discrete wavelet transform. After these steps, the original sequence is turned into a feature vector with numeric values, which can then be used for clustering and/or classification.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>Using two different types of applications, namely, clustering and classification, we compared SSAW against the the-state-of-the-art alignment free sequence analysis methods. SSAW demonstrates competitive or superior performance in terms of standard indicators, such as accuracy, F-score, precision, and recall. The running time was significantly better in most cases. These make SSAW a suitable method for sequence analysis, especially, given the rapidly increasing volumes of sequence data required by most modern applications.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>
<italic>k</italic>
-mers</kwd>
<kwd>Wavelet transform</kwd>
<kwd>Complex numbers</kwd>
<kwd>Sequence similarity</kwd>
<kwd>Frequency domain</kwd>
</kwd-group>
<funding-group>
<award-group>
<funding-source>
<institution>Chinese National Natural Science Foundation</institution>
</funding-source>
<award-id>61472082</award-id>
</award-group>
</funding-group>
<funding-group>
<award-group>
<funding-source>
<institution>Natural Science Foundation of Fujian Province of China</institution>
</funding-source>
<award-id>2014J01220</award-id>
</award-group>
</funding-group>
<funding-group>
<award-group>
<funding-source>
<institution>Scientific Research Innovation Team Construction Program of Fujian Normal University </institution>
</funding-source>
<award-id>IRTL1702</award-id>
</award-group>
</funding-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2018</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Background</title>
<p>Efficient and accurate similarity analysis for a large number of sequences is a challenging problem in computational biology [
<xref ref-type="bibr" rid="CR1">1</xref>
,
<xref ref-type="bibr" rid="CR2">2</xref>
]. Alignment-based and alignment-free sequence similarity analysis are the two primary approaches to this problem. However, the huge computational time requirement of the traditional alignment-based methods is a major bottleneck [
<xref ref-type="bibr" rid="CR3">3</xref>
]. Alignment-free methods have continued to grow in popularity, given their high time efficiency and competitive performance with respect to accuracy [
<xref ref-type="bibr" rid="CR3">3</xref>
<xref ref-type="bibr" rid="CR5">5</xref>
].</p>
<p>Over the years, alignment-free methods have been used on various sequence analysis problems in biology and medicine, including DNA sequences [
<xref ref-type="bibr" rid="CR6">6</xref>
<xref ref-type="bibr" rid="CR8">8</xref>
], RNA sequences [
<xref ref-type="bibr" rid="CR9">9</xref>
], protein sequences [
<xref ref-type="bibr" rid="CR10">10</xref>
,
<xref ref-type="bibr" rid="CR11">11</xref>
], as well as in detection of single nucleotide variants in genomes [
<xref ref-type="bibr" rid="CR12">12</xref>
], cancer mutations [
<xref ref-type="bibr" rid="CR13">13</xref>
], analysis of genetic gene transfer [
<xref ref-type="bibr" rid="CR14">14</xref>
,
<xref ref-type="bibr" rid="CR15">15</xref>
], and even in clinical practice [
<xref ref-type="bibr" rid="CR16">16</xref>
]. Although initially developed for problems in computational biology [
<xref ref-type="bibr" rid="CR17">17</xref>
<xref ref-type="bibr" rid="CR22">22</xref>
], alignment-free methods have found significant applications in many other application areas, e.g., computer science [
<xref ref-type="bibr" rid="CR1">1</xref>
,
<xref ref-type="bibr" rid="CR2">2</xref>
], graphics [
<xref ref-type="bibr" rid="CR23">23</xref>
], and forensic science [
<xref ref-type="bibr" rid="CR24">24</xref>
].</p>
<p>Alignment-free approaches are broadly divided into two groups [
<xref ref-type="bibr" rid="CR3">3</xref>
]: word-based methods and information theory based methods. Word-based methods commonly divide sequences into words(also called
<italic>k</italic>
-mers,
<italic>k</italic>
-tuples, or
<italic>k</italic>
-strings) in order to compare their similarity (/dissimilarity) [
<xref ref-type="bibr" rid="CR25">25</xref>
]. Information theory based methods usually evaluate the informational content of full sequences [
<xref ref-type="bibr" rid="CR26">26</xref>
<xref ref-type="bibr" rid="CR29">29</xref>
]. According to Bonhamcarter et al. [
<xref ref-type="bibr" rid="CR25">25</xref>
],the word-based methods can be further divided into five categories, namely, base-base correlations (BBC), feature frequency profiles (FFPs), compositional vectors(CVs), string composition methods, and the
<italic>D</italic>
<sub>2</sub>
-statistic family.</p>
<p>Our proposed SSAW method is more closely related to the feature frequency profiles under the word-based methods [
<xref ref-type="bibr" rid="CR25">25</xref>
]. Bonhamcarter et al. [
<xref ref-type="bibr" rid="CR25">25</xref>
] surveyed 14 different alignment-free word-based methods [
<xref ref-type="bibr" rid="CR27">27</xref>
,
<xref ref-type="bibr" rid="CR29">29</xref>
<xref ref-type="bibr" rid="CR37">37</xref>
]. Many new approaches continue to emerge [
<xref ref-type="bibr" rid="CR3">3</xref>
,
<xref ref-type="bibr" rid="CR38">38</xref>
<xref ref-type="bibr" rid="CR41">41</xref>
]. Among them, the Wavelet-based Feature Vector(WFV) model by Bao et al. [
<xref ref-type="bibr" rid="CR41">41</xref>
] transformed DNA sequences into a numeric feature vector for further classification. Our work is inspired by this transformation.</p>
<p>The Fourier transform has been attempted to convert DNA sequences to different feature vectors and was reported to be efficient [
<xref ref-type="bibr" rid="CR42">42</xref>
<xref ref-type="bibr" rid="CR45">45</xref>
]. Although the Fourier transformation is able to clearly characterize a sequence in the frequency domain, it is not sensitive to the time domain. The wavelet transformation has been used to overcome this shortcoming [
<xref ref-type="bibr" rid="CR46">46</xref>
,
<xref ref-type="bibr" rid="CR47">47</xref>
]. Haimovich et al. [
<xref ref-type="bibr" rid="CR48">48</xref>
] studied DNA sequences of different functions, and found that the wavelet transform of the DNA walk constructed from the varied genome sequences (from short to long nucleotide sequences) provides an effective representation for sequence analysis. Nanni et al. [
<xref ref-type="bibr" rid="CR49">49</xref>
] used wavelet trees to combine different features to improve classification performance.</p>
<p>The discrete and stationary wavelet transforms are popular approaches in signal analysis using wavelets [
<xref ref-type="bibr" rid="CR50">50</xref>
]. Bao et al. [
<xref ref-type="bibr" rid="CR41">41</xref>
] proposed Wavelet-based Feature Vector (WFV) model where DNA sequences were discretely transformed into digital sequences according to the rules of
<italic>A</italic>
=0,
<italic>C</italic>
=1,
<italic>G</italic>
=2, and
<italic>T</italic>
=3. The local frequency entropy of the sequence based on the location distribution and word frequency of the base is calculated. A feature vector with fixed length representing a DNA sequence is extracted by using the Discrete Wavelet Transformation (DWT). The stationary wavelet transformation is reported to be lossless [
<xref ref-type="bibr" rid="CR51">51</xref>
] and provides a better performance in image transformation than the discrete counterpart [
<xref ref-type="bibr" rid="CR52">52</xref>
,
<xref ref-type="bibr" rid="CR53">53</xref>
]. The major reason is that the Discrete Wavelet Transform (DWT) has a downsampling step which discards information in the process. Because the stationary discrete wavelet transform does not have a downsampling step, the length of the approximation coefficients are the same as the input signal after decomposition. Hence, the stationary wavelet transformation is used in this study.</p>
<p>Thus, the proposed SSAW (Sequence Similarily Analysis using the Stationary Discrete Wavelet Transform) model is based on the stationary wavelet transformation. The
<italic>k</italic>
-mers of different lengths are extracted from the sequences and transformed into a feature vector with complex numbers by mapping to an unit circle. This process reduces the dimensionality of the data and also improves the computation speed. The experimental results show the effectiveness of the SSAW approach, demonstrating improved accuracy and faster running time, when compared with WFV, and other recent approaches. Below, we provide a brief description on the stationary discrete wavelet transform.</p>
<sec id="Sec2">
<title>Stationary discrete wavelet transform</title>
<p>Given a function
<italic>x</italic>
(
<italic>t</italic>
), its continuous wavelet transformation,
<italic>C</italic>
<italic>W</italic>
<italic>T</italic>
(
<italic>x</italic>
) is obtained by applying a mother wavelet function
<inline-formula id="IEq1">
<alternatives>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\psi ^{*}\big (\frac {t-b}{a}\big)$\end{document}</tex-math>
<mml:math id="M2">
<mml:msup>
<mml:mrow>
<mml:mi>ψ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msup>
<mml:mstyle mathsize="1.19em">
<mml:mfenced close="" open="(" separators="">
<mml:mrow></mml:mrow>
</mml:mfenced>
</mml:mstyle>
<mml:mfrac>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo></mml:mo>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mstyle mathsize="1.19em">
<mml:mfenced close="" open=")" separators="">
<mml:mrow></mml:mrow>
</mml:mfenced>
</mml:mstyle>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq1.gif"></inline-graphic>
</alternatives>
</inline-formula>
, as shown in Eq.
<xref rid="Equ1" ref-type="">1</xref>
:
<disp-formula id="Equ1">
<label>1</label>
<alternatives>
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ CWT_{x}(a,b)=\frac{1}{|\sqrt{a}|}\int^{\infty}_{-\infty}x(t)\psi^{*}\left(\frac{t-b}{a}\right)dt $$ \end{document}</tex-math>
<mml:math id="M4">
<mml:mtext mathvariant="italic">CW</mml:mtext>
<mml:msub>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
</mml:msqrt>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:munderover>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mi>x</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>ψ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msup>
<mml:mfenced close=")" open="(" separators="">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo></mml:mo>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
<mml:mtext mathvariant="italic">dt</mml:mtext>
</mml:math>
<graphic xlink:href="12859_2018_2155_Article_Equ1.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>where,
<italic>C</italic>
<italic>W</italic>
<italic>T</italic>
<sub>
<italic>x</italic>
</sub>
(
<italic>a</italic>
,
<italic>b</italic>
) is the wavelet transform for the signal x(t),
<italic>a</italic>
is the scale parameter,
<italic>b</italic>
is the translation distance, and
<inline-formula id="IEq2">
<alternatives>
<tex-math id="M5">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\psi ^{*} \big (\frac {t-b}{a}\big) $\end{document}</tex-math>
<mml:math id="M6">
<mml:msup>
<mml:mrow>
<mml:mi>ψ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msup>
<mml:mstyle mathsize="1.19em">
<mml:mfenced close="" open="(" separators="">
<mml:mrow></mml:mrow>
</mml:mfenced>
</mml:mstyle>
<mml:mfrac>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo></mml:mo>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mstyle mathsize="1.19em">
<mml:mfenced close="" open=")" separators="">
<mml:mrow></mml:mrow>
</mml:mfenced>
</mml:mstyle>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq2.gif"></inline-graphic>
</alternatives>
</inline-formula>
is the mother wavelet function.</p>
<p>A common practive is to discretize the scale and translation parameters by the power series. Variables
<italic>a</italic>
and
<italic>b</italic>
can be respectively discretized as follows:
<disp-formula id="Equa">
<alternatives>
<tex-math id="M7">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$a=a_{0}^{j}, b=nb_{0}a_{0}^{j}; \text{where}\ j,n\in Z, a_{0},b_{0}\in Z, \text{and}\ a_{0}\neq 1. $$ \end{document}</tex-math>
<mml:math id="M8">
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>n</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msubsup>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>;</mml:mo>
<mml:mtext>where</mml:mtext>
<mml:mspace width="1em"></mml:mspace>
<mml:mi>j</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo></mml:mo>
<mml:mi>Z</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>Z</mml:mi>
<mml:mo>,</mml:mo>
<mml:mtext>and</mml:mtext>
<mml:mspace width="1em"></mml:mspace>
<mml:msub>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mi>.</mml:mi>
</mml:mrow>
</mml:math>
<graphic xlink:href="12859_2018_2155_Article_Equa.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>In general,
<italic>a</italic>
<sub>0</sub>
=2, and
<italic>b</italic>
<sub>0</sub>
=1. Then the mother wavelet can be expressed as:
<disp-formula id="Equb">
<alternatives>
<tex-math id="M9">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\psi_{j,n}(t)=2^{\frac {-j}{2}} \psi\left(2^{-j}t-n\right) $$ \end{document}</tex-math>
<mml:math id="M10">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>ψ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:msup>
<mml:mi>ψ</mml:mi>
<mml:mfenced close=")" open="(" separators="">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mi>t</mml:mi>
<mml:mo></mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
<graphic xlink:href="12859_2018_2155_Article_Equb.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>Thus, the corresponding discrete wavelet transform is given by:
<disp-formula id="Equ2">
<label>2</label>
<alternatives>
<tex-math id="M11">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ DWT_{x}(j,n)=2^{-\frac{j}{2}}\int^{\infty}_{-\infty}x(t)\psi_{j,n}^{*}\left(\frac{t}{2^{j}}-n\right)dt $$ \end{document}</tex-math>
<mml:math id="M12">
<mml:mtext mathvariant="italic">DW</mml:mtext>
<mml:msub>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:msup>
<mml:munderover>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mi>x</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>ψ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mfenced close=")" open="(" separators="">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfrac>
<mml:mo></mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mtext mathvariant="italic">dt</mml:mtext>
</mml:math>
<graphic xlink:href="12859_2018_2155_Article_Equ2.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>where,
<italic>j</italic>
is the scale parameter, and
<italic>n</italic>
is the translation distance.</p>
<p>The wavelet transform has the ability to characterize the local characteristics of the signal in both the time domain and the frequency domain. It is a time-frequency localized analysis method which can change the time window and frequency domain window with multi-resolution analysis. The wavelet transform obtains the time information of the signal by translating the parent wavelet. The frequency characteristics of the signal are obtained by scaling the width of the parent wavelet.</p>
<p>With the discrete wavelet transform(DWT), each time the signal is decomposed, it is also downsampled. This means that the sampled signal has to be chosen from one of even signal or odd signals (and not both). That is, with one decomposition process, half of the data is lost. Therefore, with increasing DWT decomposition steps, the extracted signals will lose significant time-shifted information in the original sequence. The stationary wavelet transform (SWT) does not apply the downsampling process. Thus, it preserves the information in the original sequence better. The SWT decomposition method yields the approximation coefficients and the detail coefficients. The approximation coefficients preserves most of the information and reflects the transformation characteristics of the signal. The detail coefficients mainly preserves the local and noise characteristics of the signal, and can be discarded. In this work, only the approximation coefficients are used in representing the input sequence.</p>
<p>The proposed SSAW model uses a simple Haar mother wavelet to construct the feature vector. The Haar wavelet has a tightly supported orthogonal wavelet with short support length. The Haar wavelet function
<italic>ψ</italic>
<sub>
<italic>H</italic>
</sub>
is defined as follows:
<disp-formula id="Equ3">
<label>3</label>
<alternatives>
<tex-math id="M13">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ \psi_{H} (x)= \left\{ \begin{array}{rl} 1 & 0 \le x \le \frac{1}{2}\\ -1 & \frac{1}{2} < x \le 1\\ 0 & otherwise\\ \end{array} \right\} $$ \end{document}</tex-math>
<mml:math id="M14">
<mml:msub>
<mml:mrow>
<mml:mi>ψ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfenced close="}" open="{" separators="">
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mn>1</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
<mml:mo></mml:mo>
<mml:mi>x</mml:mi>
<mml:mo></mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:mo><</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mtext mathvariant="italic">otherwise</mml:mtext>
</mml:mtd>
</mml:mtr>
<mml:mtr></mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
</mml:math>
<graphic xlink:href="12859_2018_2155_Article_Equ3.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>Different mother wavelets have different time-frequency characteristics. In the time-frequency analysis window, the smaller the width of the time domain window, the better the performance of the parent wavelet in time domain analysis. Similarly, the smaller the width of the frequency domain window, the better the performance of the parent wavelet in frequency domain analysis.</p>
</sec>
</sec>
<sec id="Sec3">
<title>Methods</title>
<sec id="Sec4">
<title>Detailed steps</title>
<p>There are four steps in our proposed SSAW method. First,
<italic>k</italic>
-mers are extracted from a sequence and their corresponding frequencies are counted and standardized/normalized. Second, each
<italic>k</italic>
-mer is transformed into a complex by mapping the
<italic>k</italic>
-mers to an unit circle. Third, the stationary wavelet transformation is performed on the resulting sequence of complex numbers. Finally, clustering and/or classification is applied as needed, depending on the specific application of interest.</p>
<sec id="Sec5">
<title>Step 1:
<italic>k</italic>
-mer extraction and frequency standardization</title>
<p>Given a genetic sequence
<italic>S</italic>
of length
<italic>M</italic>
,
<italic>k</italic>
-mers are extracted from the sequence by passing a sliding window of length
<italic>k</italic>
(varied from 2 to
<italic>M</italic>
−1) over the sequence. There are
<italic>M</italic>
<italic>k</italic>
+1 total
<italic>k</italic>
-mers in a sequence with length
<italic>M</italic>
. And there are at most |
<italic>Σ</italic>
|
<sup>
<italic>k</italic>
</sup>
individual
<italic>k</italic>
-mers for a sequence with |
<italic>Σ</italic>
| alphabets. For a fixed
<italic>k</italic>
, a unit circle is divided evenly into |
<italic>Σ</italic>
|
<sup>
<italic>k</italic>
</sup>
parts. A DNA sequence consists of symbols from the alphabetic
<italic>Σ</italic>
={
<italic>A</italic>
,
<italic>C</italic>
,
<italic>G</italic>
,
<italic>T</italic>
}, then |
<italic>Σ</italic>
|=4. A protein sequence consists of symbols from a larger alphabet,
<italic>Σ</italic>
= {
<italic>A</italic>
,
<italic>C</italic>
,
<italic>D</italic>
,
<italic>E</italic>
,
<italic>F</italic>
,
<italic>G</italic>
,
<italic>H</italic>
,
<italic>I</italic>
,
<italic>K</italic>
,
<italic>L</italic>
,
<italic>M</italic>
,
<italic>N</italic>
,
<italic>P</italic>
,
<italic>Q</italic>
,
<italic>R</italic>
,
<italic>S</italic>
,
<italic>T</italic>
,
<italic>V</italic>
,
<italic>W</italic>
,
<italic>Y</italic>
}, with |
<italic>Σ</italic>
|=20.</p>
<p>Let
<italic>X</italic>
<sub>
<italic>t</italic>
</sub>
denote the frequency of the
<italic>t</italic>
-th
<italic>k</italic>
-mer in a sequence and let
<italic>S</italic>
<sub>
<italic>t</italic>
</sub>
represent the standardization of
<italic>X</italic>
<sub>
<italic>t</italic>
</sub>
by using
<italic>z</italic>
-score normalization, as shown in Eq.
<xref rid="Equ4" ref-type="">4</xref>
.
<disp-formula id="Equ4">
<label>4</label>
<alternatives>
<tex-math id="M15">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ S_{t}= \frac{X_{t}-\overline{X}}{sd} $$ \end{document}</tex-math>
<mml:math id="M16">
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mover accent="false">
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mo accent="true">¯</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">sd</mml:mtext>
</mml:mrow>
</mml:mfrac>
</mml:math>
<graphic xlink:href="12859_2018_2155_Article_Equ4.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>where
<inline-formula id="IEq3">
<alternatives>
<tex-math id="M17">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\overline {X}$\end{document}</tex-math>
<mml:math id="M18">
<mml:mover accent="false">
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mo accent="true">¯</mml:mo>
</mml:mover>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq3.gif"></inline-graphic>
</alternatives>
</inline-formula>
represents the mean frequency of a
<italic>k</italic>
-mer
<italic>X</italic>
occuring in all the sequences. The denominator
<italic>sd</italic>
denotes the standard deviation of the frequencies of the
<italic>k</italic>
-mer
<italic>X</italic>
in all the sequences.</p>
<p>Motivated by the work in [
<xref ref-type="bibr" rid="CR18">18</xref>
,
<xref ref-type="bibr" rid="CR54">54</xref>
], we use the following recommended length for
<italic>k</italic>
, given by:
<disp-formula id="Equ5">
<label>5</label>
<alternatives>
<tex-math id="M19">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ k= \left\lceil \log_{|\Sigma|}\left(\sqrt{|S|}\right)\right\rceil=\left\lceil \frac{\log_{|\Sigma|}(|S|)}{2} \right\rceil $$ \end{document}</tex-math>
<mml:math id="M20">
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfenced close="⌉" open="⌈" separators="">
<mml:mrow>
<mml:munder>
<mml:mrow>
<mml:mo>log</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mi>Σ</mml:mi>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:munder>
<mml:mfenced close=")" open="(" separators="">
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mi>S</mml:mi>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mfenced close="⌉" open="⌈" separators="">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:munder>
<mml:mrow>
<mml:mo>log</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mi>Σ</mml:mi>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:munder>
<mml:mo>(</mml:mo>
<mml:mo>|</mml:mo>
<mml:mi>S</mml:mi>
<mml:mo>|</mml:mo>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
</mml:math>
<graphic xlink:href="12859_2018_2155_Article_Equ5.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>where |
<italic>S</italic>
| is the average of a sequence length.</p>
</sec>
<sec id="Sec6">
<title>Step 2: Transform
<italic>k</italic>
-mers to complex numbers</title>
<p>For a sequence with symbols from an alphabet
<italic>Σ</italic>
, there are at most |
<italic>Σ</italic>
|
<sup>
<italic>k</italic>
</sup>
unique
<italic>k</italic>
-mers. First, sort all
<italic>k</italic>
-mers alphabetically. Given a unit circle, we evenly distribute all the |
<italic>Σ</italic>
|
<sup>
<italic>k</italic>
</sup>
<italic>k</italic>
-mers around the circumference of the unit circle, moving counterclockwise. A
<italic>k</italic>
-mer is transformed into a complex number as follows:
<list list-type="bullet">
<list-item>
<p>The sine of the angle the
<italic>k</italic>
-mer resides in becomes the real part of a complex number;</p>
</list-item>
<list-item>
<p>the cosine of the angle the
<italic>k</italic>
-mer resides in becomes the imaginary part of a complex number.</p>
</list-item>
</list>
</p>
<p>The angle of the
<italic>t</italic>
-th
<italic>k</italic>
-mer
<italic>φ</italic>
<sub>
<italic>t</italic>
</sub>
is given by:
<disp-formula id="Equ6">
<label>6</label>
<alternatives>
<tex-math id="M21">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ \varphi_{t}=\frac{360}{|\Sigma|^{K}}\times t $$ \end{document}</tex-math>
<mml:math id="M22">
<mml:msub>
<mml:mrow>
<mml:mi>φ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>360</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mi>Σ</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfrac>
<mml:mo>×</mml:mo>
<mml:mi>t</mml:mi>
</mml:math>
<graphic xlink:href="12859_2018_2155_Article_Equ6.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>where
<italic>t</italic>
denotes the position of the
<italic>t</italic>
-th
<italic>k</italic>
-mer in
<italic>Σ</italic>
<sup>
<italic>k</italic>
</sup>
.</p>
<p>Thus, the complex number representation for the
<italic>t</italic>
-th
<italic>k</italic>
-mer will be given by : <
<italic>R</italic>
<italic>e</italic>
<italic>a</italic>
<italic>l</italic>
<sub>
<italic>t</italic>
</sub>
,
<italic>I</italic>
<italic>m</italic>
<italic>a</italic>
<italic>g</italic>
<sub>
<italic>t</italic>
</sub>
>=<
<italic>s</italic>
<italic>i</italic>
<italic>n</italic>
(
<italic>φ</italic>
<sub>
<italic>t</italic>
</sub>
),
<italic>c</italic>
<italic>o</italic>
<italic>s</italic>
(
<italic>φ</italic>
<sub>
<italic>t</italic>
</sub>
)>, where
<italic>R</italic>
<italic>e</italic>
<italic>a</italic>
<italic>l</italic>
<sub>
<italic>t</italic>
</sub>
= sin(
<italic>φ</italic>
<sub>
<italic>t</italic>
</sub>
) is the real part, and
<italic>I</italic>
<italic>m</italic>
<italic>a</italic>
<italic>g</italic>
<sub>
<italic>t</italic>
</sub>
= cos(
<italic>φ</italic>
<sub>
<italic>t</italic>
</sub>
) is the imaginary part.</p>
</sec>
<sec id="Sec7">
<title>Step 3: Stationary wavelet transformation</title>
<p>After a sequence is transformed into a series of complex numbers, the real and imaginary parts of the complex numbers are multiplied by the corresponding standardized frequency (
<italic>S</italic>
<sub>
<italic>t</italic>
</sub>
) of
<italic>k</italic>
-mers from the first step. And then, the stationary wavelet transformation is performed. Given an original string
<italic>S</italic>
, let
<italic>C</italic>
<italic>O</italic>
<italic>D</italic>
<italic>E</italic>
<sub>
<italic>S</italic>
</sub>
denote the series of complex numbers which are the combination of the real part and the imaginary part based on the sequence of
<italic>k</italic>
-mers. We apply the Haar transformation on
<italic>C</italic>
<italic>O</italic>
<italic>D</italic>
<italic>E</italic>
<sub>
<italic>S</italic>
</sub>
as shown in Eq.
<xref rid="Equ7" ref-type="">7</xref>
.
<disp-formula id="Equ7">
<label>7</label>
<alternatives>
<tex-math id="M23">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ F(S)=HaarSDWT_{AC}\left(CODE_{S},L\right) $$ \end{document}</tex-math>
<mml:math id="M24">
<mml:mi>F</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>S</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mtext mathvariant="italic">HaarSDW</mml:mtext>
<mml:msub>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">AC</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mfenced close=")" open="(" separators="">
<mml:mrow>
<mml:mtext mathvariant="italic">COD</mml:mtext>
<mml:msub>
<mml:mrow>
<mml:mi>E</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:math>
<graphic xlink:href="12859_2018_2155_Article_Equ7.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>where,
<italic>F</italic>
(
<italic>S</italic>
) denotes the feature vector representing sequence
<italic>S</italic>
, and
<italic>L</italic>
is the decomposition level. The function
<italic>H</italic>
<italic>a</italic>
<italic>a</italic>
<italic>r</italic>
<italic>S</italic>
<italic>D</italic>
<italic>W</italic>
<italic>T</italic>
<sub>
<italic>AC</italic>
</sub>
() denotes the SDWT using the Haar mother wavelet, while retaining the AC coefficients. We use the package SWT2 [
<xref ref-type="bibr" rid="CR55">55</xref>
] in MATLAB for this transformation. A feature vector
<italic>F</italic>
(
<italic>S</italic>
) is obtained after the transformation.</p>
</sec>
<sec id="Sec8">
<title>Step 4: Clustering/classification using the feature vectors.</title>
<p>After the above processing, a text sequence is transformed into a feature vector. These feature vectors can then be used in clustering and classification applications. For proof of concept, we applied a simple clustering technique(namely, the
<italic>k</italic>
-means clustering algorithm) on the feature vectors. Similarly, for classification, we applied simple classification approaches (namely,
<italic>k</italic>
-Nearest Neighbor approach, using just
<italic>k</italic>
=1). In the classification experiment, the 1-Nearest Neighbour (
<italic>1-NN</italic>
) classification algorithm is applied. Finally, the experimental results are evaluated.</p>
</sec>
<sec id="Sec9">
<title>A simple example</title>
<p>Here, we discuss a simple example. Given two DNA sequences,
<italic>S1:AACAA</italic>
and
<italic>S2:CCGCC</italic>
. Assume that the sliding window length
<italic>K</italic>
is 2. There are |
<italic>Σ</italic>
|
<sup>
<italic>K</italic>
</sup>
= 4
<sup>2</sup>
=16 unique
<italic>k</italic>
-mers. The unit circle will be divided into 16 parts in this case.</p>
<p>As shown in Table 
<xref rid="Tab1" ref-type="table">1</xref>
, all 16
<italic>k</italic>
-mers are listed on the first line. The frequency of a
<italic>k</italic>
-mer (
<italic>X</italic>
<sub>
<italic>t</italic>
</sub>
) for a sequence is counted respectively. Many
<italic>k</italic>
-mers have a zero frequency in this simple example. However, in real applications, this is seldom the case, since the sequences are generally much longer. Similarly, the the standard deviation
<italic>sd</italic>
in the denominator are rarely zero. See Eq.
<xref rid="Equ4" ref-type="">4</xref>
. For the purpose of this demonstration only, we assume a series of non-zero values for
<italic>sd</italic>
which are shown on the last row in the table. The similar assumption is applied to
<inline-formula id="IEq4">
<alternatives>
<tex-math id="M25">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\overline {X}$\end{document}</tex-math>
<mml:math id="M26">
<mml:mover accent="false">
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mo accent="true">¯</mml:mo>
</mml:mover>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq4.gif"></inline-graphic>
</alternatives>
</inline-formula>
which is listed on the second last line.
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>Length 2
<italic>k</italic>
-mers and associated standardized frequencies (Eq.
<xref rid="Equ4" ref-type="">4</xref>
)</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th align="left">k-mers</th>
<th align="left">AA</th>
<th align="left">AC</th>
<th align="left">AG</th>
<th align="left">AT</th>
<th align="left">CA</th>
<th align="left">CC</th>
<th align="left">CG</th>
<th align="left">CT</th>
<th align="left">GA</th>
<th align="left">GC</th>
<th align="left">GG</th>
<th align="left">GT</th>
<th align="left">TA</th>
<th align="left">TC</th>
<th align="left">TG</th>
<th align="left">TT</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">S1</td>
<td align="left">
<italic>X</italic>
<sub>
<italic>t</italic>
</sub>
</td>
<td align="left">2</td>
<td align="left">1</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">1</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">0</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">
<italic>S</italic>
<sub>
<italic>t</italic>
</sub>
</td>
<td align="left">0.07</td>
<td align="left">-0.84</td>
<td align="left">-0.17</td>
<td align="left">-0.38</td>
<td align="left">-0.76</td>
<td align="left">-0.76</td>
<td align="left">-0.55</td>
<td align="left">-0.38</td>
<td align="left">-0.09</td>
<td align="left">-0.76</td>
<td align="left">-0.42</td>
<td align="left">-0.14</td>
<td align="left">-0.09</td>
<td align="left">-0.35</td>
<td align="left">-0.18</td>
<td align="left">-0.3</td>
</tr>
<tr>
<td align="left">S2</td>
<td align="left">
<italic>X</italic>
<sub>
<italic>t</italic>
</sub>
</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">2</td>
<td align="left">1</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">1</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">0</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">
<italic>S</italic>
<sub>
<italic>t</italic>
</sub>
</td>
<td align="left">-0.41</td>
<td align="left">-1.13</td>
<td align="left">-0.17</td>
<td align="left">-0.38</td>
<td align="left">-1.02</td>
<td align="left">-0.23</td>
<td align="left">-0.29</td>
<td align="left">-0.38</td>
<td align="left">-0.09</td>
<td align="left">-0.48</td>
<td align="left">-0.42</td>
<td align="left">-0.14</td>
<td align="left">-0.09</td>
<td align="left">-0.35</td>
<td align="left">-0.18</td>
<td align="left">-0.3</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">
<inline-formula id="IEq5">
<alternatives>
<tex-math id="M27">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\overline {X}$\end{document}</tex-math>
<mml:math id="M28">
<mml:mover accent="false">
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mo accent="true">¯</mml:mo>
</mml:mover>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq5.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">1.7</td>
<td align="left">3.9</td>
<td align="left">0.9</td>
<td align="left">1.3</td>
<td align="left">3.9</td>
<td align="left">2.9</td>
<td align="left">2.1</td>
<td align="left">1.3</td>
<td align="left">0.3</td>
<td align="left">2.7</td>
<td align="left">1.5</td>
<td align="left">0.7</td>
<td align="left">0.3</td>
<td align="left">1.2</td>
<td align="left">0.7</td>
<td align="left">1.1</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">sd</td>
<td align="left">4.14</td>
<td align="left">3.45</td>
<td align="left">5.17</td>
<td align="left">3.45</td>
<td align="left">3.84</td>
<td align="left">3.84</td>
<td align="left">3.84</td>
<td align="left">3.45</td>
<td align="left">3.45</td>
<td align="left">3.55</td>
<td align="left">3.55</td>
<td align="left">5.07</td>
<td align="left">3.45</td>
<td align="left">3.45</td>
<td align="left">3.89</td>
<td align="left">3.71</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>Then, Eq.
<xref rid="Equ4" ref-type="">4</xref>
is applied to calculate the corresponding standard deviation (
<italic>S</italic>
<sub>
<italic>t</italic>
</sub>
) of a
<italic>k</italic>
-mer. For example, for the first
<italic>k</italic>
-mer
<italic>AA</italic>
in sequence
<italic>S</italic>
1, the normalized value is
<inline-formula id="IEq6">
<alternatives>
<tex-math id="M29">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\frac {2-1.7}{4.14}=0.07$\end{document}</tex-math>
<mml:math id="M30">
<mml:mfrac>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo></mml:mo>
<mml:mn>1.7</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>4.14</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:mo>=</mml:mo>
<mml:mn>0.07</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq6.gif"></inline-graphic>
</alternatives>
</inline-formula>
.</p>
<p>In the second step, the unit circle is divided into 16 equal parts. Since length of
<italic>k</italic>
-mer is assumed to be 2 here, there are |
<italic>Σ</italic>
|
<sup>
<italic>K</italic>
</sup>
= 4
<sup>2</sup>
=16 possible unique
<italic>k</italic>
-mers. These 16
<italic>k</italic>
-mers are distributed on the unit circle in a counterclockwise manner, as shown in the Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
.
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>The distribution of 16
<italic>k</italic>
-mers (AA, AC, …, TT) on the unit circle, moving counterclockwise</p>
</caption>
<graphic xlink:href="12859_2018_2155_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
<p>Each
<italic>k</italic>
-mer has a corresponding radian measurement. For example, for the first
<italic>k</italic>
-mer
<italic>AA</italic>
, the radian is
<inline-formula id="IEq7">
<alternatives>
<tex-math id="M31">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\frac {360}{|\Sigma |^{K}}\times t=\frac {360}{4^{2}}\times 1$\end{document}</tex-math>
<mml:math id="M32">
<mml:mfrac>
<mml:mrow>
<mml:mn>360</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mi>Σ</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfrac>
<mml:mo>×</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>360</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfrac>
<mml:mo>×</mml:mo>
<mml:mn>1</mml:mn>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq7.gif"></inline-graphic>
</alternatives>
</inline-formula>
=22.5. We have
<italic>R</italic>
<italic>e</italic>
<italic>a</italic>
<italic>l</italic>
<sub>
<italic>t</italic>
</sub>
= sin(22.5)=0.38. The imaginary part of the complex number value is:
<italic>I</italic>
<italic>m</italic>
<italic>a</italic>
<italic>g</italic>
<sub>
<italic>t</italic>
</sub>
= cos(22.5)=0.92. Hence, the corresponding
<italic>k</italic>
-mer
<italic>AA</italic>
in sequence
<italic>S</italic>
1 is represented as a complex number (0.38,0.92). Then, the standardized frequency
<italic>S</italic>
<sub>
<italic>t</italic>
</sub>
(0.07) from the first step is multiplized to this complex number (0.38,0.92), resulting in the pair (0.0266,0.0644).</p>
<p>After processing all the
<italic>k</italic>
-mers, a series of complex numbers starting with (0.0266,0.0644) are input into the third transformation step. After the third step (stationary wavelet transform), a feature vector will be obtained which can then be used for clustering and/or classification.</p>
</sec>
</sec>
<sec id="Sec10">
<title>Distance measurement</title>
<p>The similarity between feature vectors is measured using the Euclidean distance as follows.
<disp-formula id="Equ8">
<label>8</label>
<alternatives>
<tex-math id="M33">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ Eu_{d}(S_{1},S_{2})= \sqrt{\sum_{i=1}^{Vec}|F_{i}(S_{1})-F_{i}(S_{2})|^{2}} $$ \end{document}</tex-math>
<mml:math id="M34">
<mml:mi>E</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>u</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:munderover>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">Vec</mml:mtext>
</mml:mrow>
</mml:munderover>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msqrt>
</mml:math>
<graphic xlink:href="12859_2018_2155_Article_Equ8.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>where
<italic>Vec</italic>
is the length of the feature vector,
<italic>F</italic>
(
<italic>S</italic>
<sub>1</sub>
) and
<italic>F</italic>
(
<italic>S</italic>
<sub>2</sub>
) denote feature vectors for sequences
<italic>S</italic>
<sub>1</sub>
and
<italic>S</italic>
<sub>2</sub>
respectively.</p>
</sec>
<sec id="Sec11">
<title>The measurement of clustering assessment</title>
<p>The F-score is used to evaluate the clustering results. Let
<italic>C</italic>
<sub>
<italic>i</italic>
</sub>
represent the number of sequences in the family
<italic>i</italic>
; let
<italic>C</italic>
<sub>
<italic>ij</italic>
</sub>
represent the number of sequences belonging to cluster
<italic>j</italic>
in family
<italic>i</italic>
.
<italic>l</italic>
<italic>b</italic>
(
<italic>j</italic>
) represents the family tag of cluster
<italic>j</italic>
, when clustering, the goal is to cluster a sequence in family
<italic>j</italic>
to be in cluster
<italic>l</italic>
<italic>b</italic>
(
<italic>j</italic>
).</p>
<p>The sequences in family
<italic>i</italic>
are decided to belong to the cluster
<italic>j</italic>
by using dominating rule, the cluster that contains the largest number of sequences is selected to be
<italic>l</italic>
<italic>b</italic>
(
<italic>j</italic>
), shown as in Eq.
<xref rid="Equ9" ref-type="">9</xref>
:
<disp-formula id="Equ9">
<label>9</label>
<alternatives>
<tex-math id="M35">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ lb(j)=argmax_{i=1}^{fm}\left(C_{ij}\right) $$ \end{document}</tex-math>
<mml:math id="M36">
<mml:mtext mathvariant="italic">lb</mml:mtext>
<mml:mo>(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mtext mathvariant="italic">argma</mml:mtext>
<mml:msubsup>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">fm</mml:mtext>
</mml:mrow>
</mml:msubsup>
<mml:mfenced close=")" open="(" separators="">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">ij</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:math>
<graphic xlink:href="12859_2018_2155_Article_Equ9.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>where
<italic>fm</italic>
is the number of all possible families.</p>
<p>For a given family
<italic>i</italic>
, the respective values for precision, recall, and f-score are computed as follows:
<disp-formula id="Equ10">
<label>10</label>
<alternatives>
<tex-math id="M37">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ precision_{i}=\frac{\sum\limits_{lb(j)=i}C_{ij}}{\sum\limits_{lb(j)=i} \overline{ C_{j}}} $$ \end{document}</tex-math>
<mml:math id="M38">
<mml:mtext mathvariant="italic">precisio</mml:mtext>
<mml:msub>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:munder>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">lb</mml:mtext>
<mml:mo>(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msub>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">ij</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:munder>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">lb</mml:mtext>
<mml:mo>(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mover accent="false">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo accent="true">¯</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mfrac>
</mml:math>
<graphic xlink:href="12859_2018_2155_Article_Equ10.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>where
<inline-formula id="IEq8">
<alternatives>
<tex-math id="M39">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\overline {C_{j}}$\end{document}</tex-math>
<mml:math id="M40">
<mml:mover accent="false">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo accent="true">¯</mml:mo>
</mml:mover>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq8.gif"></inline-graphic>
</alternatives>
</inline-formula>
represents the number of sequences in cluster
<italic>j</italic>
.
<disp-formula id="Equ11">
<label>11</label>
<alternatives>
<tex-math id="M41">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ recall_{i}=\frac{\sum\limits_{lb(j)=i}C_{ij}}{C_{i}} $$ \end{document}</tex-math>
<mml:math id="M42">
<mml:mtext mathvariant="italic">recal</mml:mtext>
<mml:msub>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:munder>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">lb</mml:mtext>
<mml:mo>(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msub>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">ij</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:math>
<graphic xlink:href="12859_2018_2155_Article_Equ11.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>
<disp-formula id="Equ12">
<label>12</label>
<alternatives>
<tex-math id="M43">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ F-score(i)=\frac{2 \times precision(i)\times recall(i) }{precision(i)+recall(i)} $$ \end{document}</tex-math>
<mml:math id="M44">
<mml:mi>F</mml:mi>
<mml:mo></mml:mo>
<mml:mtext mathvariant="italic">score</mml:mtext>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>×</mml:mo>
<mml:mtext mathvariant="italic">precision</mml:mtext>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>×</mml:mo>
<mml:mtext mathvariant="italic">recall</mml:mtext>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">precision</mml:mtext>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mtext mathvariant="italic">recall</mml:mtext>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:math>
<graphic xlink:href="12859_2018_2155_Article_Equ12.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>The
<italic>F</italic>
-score for all families can be calculated as:
<disp-formula id="Equ13">
<label>13</label>
<alternatives>
<tex-math id="M45">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ F-score=\sum_{i=1}^{fm}\frac{C_{i}}{C}F(i) $$ \end{document}</tex-math>
<mml:math id="M46">
<mml:mi>F</mml:mi>
<mml:mo></mml:mo>
<mml:mtext mathvariant="italic">score</mml:mtext>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">fm</mml:mtext>
</mml:mrow>
</mml:munderover>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mi>F</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:math>
<graphic xlink:href="12859_2018_2155_Article_Equ13.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>where
<italic>C</italic>
is the total number of sequences in the dataset.</p>
</sec>
<sec id="Sec12">
<title>The measurement of classification</title>
<p>We use the confusion matrix (see Table 
<xref rid="Tab2" ref-type="table">2</xref>
) to evaluate the classification performance. The confusion matrix is an
<italic>N</italic>
×
<italic>N</italic>
matrix, where
<italic>N</italic>
is the number of categories in the classification. We use the predicted and original categories to establish the confusion matrix.
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>Confusion matrix</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th align="left"></th>
<th align="left">Predicted class</th>
<th align="left"></th>
</tr>
</thead>
<tbody>
<tr>
<td align="left"></td>
<td align="left"></td>
<td align="left">Positive</td>
<td align="left">Negative</td>
</tr>
<tr>
<td align="left">Actual</td>
<td align="left">Positive</td>
<td align="left">True positives(TP)</td>
<td align="left">False negatives(FN)</td>
</tr>
<tr>
<td align="left">class</td>
<td align="left">Negative</td>
<td align="left">False positives(FP)</td>
<td align="left">True negatives(TN)</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>Based on the above confusion matrix, the performance indicators are defined as follows.
<list list-type="bullet">
<list-item>
<p>Accuracy = (TP+TN)/(TP+TN+FN+FP)</p>
</list-item>
<list-item>
<p>Precision = TP/(TP+FP)</p>
</list-item>
<list-item>
<p>Recall = TP/(TP+FN)</p>
</list-item>
<list-item>
<p>F-score = 2*Precision*Recall/(Precision+Recall)</p>
</list-item>
</list>
</p>
</sec>
</sec>
<sec id="Sec13" sec-type="results">
<title>Results</title>
<p>A new alignment-free sequence similarity analysis method, SSAW, is proposed. The performance of SSAW is compared against those of two methods, namely, WFV [
<xref ref-type="bibr" rid="CR41">41</xref>
] and
<inline-formula id="IEq9">
<alternatives>
<tex-math id="M47">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M48">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq9.gif"></inline-graphic>
</alternatives>
</inline-formula>
[
<xref ref-type="bibr" rid="CR18">18</xref>
], which represent the current state-of-the-art. Compared with WFV and
<inline-formula id="IEq10">
<alternatives>
<tex-math id="M49">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M50">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq10.gif"></inline-graphic>
</alternatives>
</inline-formula>
, the SSAW method demonstrates competitive performance in clustering and classification, with respect to both effectiveness (accuracy), and efficiency (running time).</p>
<sec id="Sec14">
<title>Datasets</title>
<p>Three types of data are used in our experimental evaluation, namely, DNA sequences, protein sequences, and simulated next generation sequences. The DNA datasets are the same as those used in Bao et al.’s original paper [
<xref ref-type="bibr" rid="CR41">41</xref>
]. The longest sequence has 8748 characters and the shortest sequence has 186 characters. The HOG datasets used contained 100, 200, 300 families, with a corresponding family size of 96, 113, and 93 DNA sequences, respectively.</p>
<p>The protein datasets were obtained from [
<xref ref-type="bibr" rid="CR41">41</xref>
] too, which were randomly selected from HOGENOM by ourselves. They are also from HOG100, HOG200, and HOG300. The longest sequence has 2197 characters and the shortest sequence has 35 characters. The HOG protein datasets contained 100, 200, 300 families, with an average family size of 9, 10, 11, respectively. Both protein and DNA datasets were collected by the Institute of Biology and Chemistry of Proteins (IBCP), using PBIL (population-based incremental learning), and are available at:
<ext-link ext-link-type="uri" xlink:href="ftp://pbil.univ-lyon1.fr/pub/hogenom/release_06/">ftp://pbil.univ-lyon1.fr/pub/hogenom/release_06/</ext-link>
.</p>
<p>The third data set is our simulated DNA next-generation sequences data with a total of 520 sequences of length 47 base pairs each. There are eight classes, each with 65 sequences. The original 8 sequences are randomly selected from a next-generation sequence data set (Illumina platform) for error correction [
<xref ref-type="bibr" rid="CR56">56</xref>
]. During simulation, 8 sequences of length 47 with edit distance of 10 among them are randomly selected. These 8 sequences are regarded as the 8 data centroids. For each centroid, 64 sequences are generated with edit distance ≤ 4 from the centroid. These 8 centroids form our 8 cluster centers.</p>
</sec>
<sec id="Sec15">
<title>Experimental design</title>
<p>The experiments were performed on a machine running Windows 7 Operating System (64 bit professional edition) with Intel Core i5-3470 (3.20 GHz) CPU and 8 GB RAM. The experiments were performed on the three types of data described, and their corresponding run times (in seconds) are also recorded. The reported execution times are averages, over several iterations.</p>
<p>Firstly, we check the validity of the proposed SSAW by comparing it against the standard edit distance [
<xref ref-type="bibr" rid="CR1">1</xref>
,
<xref ref-type="bibr" rid="CR2">2</xref>
] and the global alignment identity score [
<xref ref-type="bibr" rid="CR5">5</xref>
]. The edit distance between two strings is defined as the minimum number of edit operations required to transform one string into the other. The edit distance is the basic standard used to compare two strings [
<xref ref-type="bibr" rid="CR1">1</xref>
,
<xref ref-type="bibr" rid="CR2">2</xref>
]. The Needleman-Wunsch alignment algorithm is the other golden standard in measuring sequence similarity [
<xref ref-type="bibr" rid="CR57">57</xref>
]. They both have a quadratic time complexity with respect to the length of the strings which are computed using dynamic programming [
<xref ref-type="bibr" rid="CR58">58</xref>
]. Thus, we randomly extract 100 sequences from the dataset for this validity check.</p>
<p>For clustering,
<italic>k</italic>
-means [
<xref ref-type="bibr" rid="CR59">59</xref>
] in RGui is used. Proposed SSAW, WFV by Bao et al. [
<xref ref-type="bibr" rid="CR41">41</xref>
], and
<inline-formula id="IEq11">
<alternatives>
<tex-math id="M51">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M52">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq11.gif"></inline-graphic>
</alternatives>
</inline-formula>
by Lin et al. [
<xref ref-type="bibr" rid="CR18">18</xref>
] are assessed by using F-score, precision, and recall. It is well known that, for
<italic>k</italic>
-means, the initial center is important. To diminish the influence of initial centers, the cluster center is selected randomly, and the experiment is repeated 200 times. The average value is then reported.</p>
<p>For classification experiment, we used the
<italic>1-NN</italic>
classification algorithm (
<italic>kNN</italic>
method with
<italic>k</italic>
=1). To reduce the random selection effect caused by dividing training sets and testing sets, the classification experiment is repeated 100 times and the average is reported. The stratification sampling is applied to select 80 percent of data for training, and the remaining 20 percent of data is used for testing.</p>
<p>The SSAW method has two parameters that need to be set, namely, the
<italic>k</italic>
value for
<italic>k</italic>
-mers, and the decomposition level
<italic>L</italic>
in the wavelet transformation stage. The value of
<italic>k</italic>
is determined by using Eq.
<xref rid="Equ5" ref-type="">5</xref>
, which is motivated by earlier work [
<xref ref-type="bibr" rid="CR18">18</xref>
,
<xref ref-type="bibr" rid="CR54">54</xref>
]. After running all possible decomposition levels, our experiment showed that setting
<italic>L</italic>
=
<italic>k</italic>
is the most suitable in our applications. Hence, in SSAW, the recommended parameter values for
<italic>k</italic>
and
<italic>L</italic>
can be automatically determined by using Eq.
<xref rid="Equ5" ref-type="">5</xref>
. For WFV, the vector length is fixed at 32 which is recommended by the original authors [
<xref ref-type="bibr" rid="CR41">41</xref>
].</p>
<sec id="Sec16">
<title>Validity of the proposed SSAW</title>
<p>Two groups of correlation measures are calculated on two datasets, namely, DNA sequences, and protein sequence data. One is the correlation between edit distance and the respective results of the SSAW, WFV and
<inline-formula id="IEq12">
<alternatives>
<tex-math id="M53">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M54">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq12.gif"></inline-graphic>
</alternatives>
</inline-formula>
methods. The other is the correlation between the global alignment identity score and the results of the SSAW, WFV, and
<inline-formula id="IEq13">
<alternatives>
<tex-math id="M55">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M56">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq13.gif"></inline-graphic>
</alternatives>
</inline-formula>
methods. The global alignment identity score is calculated by using the Needleman-Wunsch algorithm [
<xref ref-type="bibr" rid="CR57">57</xref>
]. 100 sequences are randomly selected from one cluster of DNA (and one family of protein sequences). Then, the edit distance, the global alignment score, and the results for SSAW, WFV and
<inline-formula id="IEq14">
<alternatives>
<tex-math id="M57">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M58">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq14.gif"></inline-graphic>
</alternatives>
</inline-formula>
are calculated between pairs of sequences. Finally, the Pearson correlation coefficient is calculated between the edit distance and the respective results from the three methods. The same correlation is repeated using the global alignment identity score, rather than the edit distance. The correlation results are shown in Table 
<xref rid="Tab3" ref-type="table">3</xref>
.
<table-wrap id="Tab3">
<label>Table 3</label>
<caption>
<p>Correlations between edit distance (the global alignment identity score) and three methods</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th align="left" colspan="3">DNA</th>
<th align="left" colspan="3">Protein</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left"></td>
<td align="left">SSAW</td>
<td align="left">WFV</td>
<td align="left">
<inline-formula id="IEq15">
<alternatives>
<tex-math id="M59">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M60">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq15.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">SSAW</td>
<td align="left">WFV</td>
<td align="left">
<inline-formula id="IEq16">
<alternatives>
<tex-math id="M61">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M62">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq16.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
</tr>
<tr>
<td align="left">
<italic>E</italic>
<italic>d</italic>
<italic>i</italic>
<italic>t</italic>
<italic>d</italic>
<italic>i</italic>
<italic>s</italic>
<italic>t</italic>
<italic>a</italic>
<italic>n</italic>
<italic>c</italic>
<italic>e</italic>
</td>
<td align="left">0.779</td>
<td align="left">0.837</td>
<td align="left">-0.67</td>
<td align="left">0.852</td>
<td align="left">0.861</td>
<td align="left">-0.842</td>
</tr>
<tr>
<td align="left">
<italic>I</italic>
<italic>d</italic>
<italic>e</italic>
<italic>n</italic>
<italic>t</italic>
<italic>i</italic>
<italic>t</italic>
<italic>y</italic>
<italic>s</italic>
<italic>c</italic>
<italic>o</italic>
<italic>r</italic>
<italic>e</italic>
</td>
<td align="left">-0.741</td>
<td align="left">-0.742</td>
<td align="left">0.799</td>
<td align="left">-0.841</td>
<td align="left">-0.822</td>
<td align="left">0.789</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>Looking at Table 
<xref rid="Tab3" ref-type="table">3</xref>
, one may wonder why some correlations is negative (positive). The reasons are as follows. The edit distance, SSAW and WFV are calculated by using distance measurements. Thus, the correlation between any two of these are positive. The global alignment identity score and
<inline-formula id="IEq17">
<alternatives>
<tex-math id="M63">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M64">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq17.gif"></inline-graphic>
</alternatives>
</inline-formula>
calculate the similarity between sequences. Thus, the latter two are similar.</p>
<p>With the Pearson correlation coefficient, a value of 0 indicates no correlation; a value of 1 indicates positive correlation, while a value of −1 indicates negative correlation. For a comparison method, a value close to 1 or − 1 indicates its ability in measuring the similarity (/dissimilarity) between sequences. On the contrary, a value close to 0 shows an inability to measure the similarity (/dissimilarity) between the given sequences.</p>
<p>For Pearson correlation, we should consider their absolute values, rather than the direct correlation values. With this in mind, Table 
<xref rid="Tab3" ref-type="table">3</xref>
shows that all the three methods are strongly correlated with the edit distance, and also with the global alignment identity score. This indicates that the three methods are all valid in measuring similarity between DNA (protein) sequences.</p>
</sec>
<sec id="Sec17">
<title>DNA data</title>
<p>Table 
<xref rid="Tab4" ref-type="table">4</xref>
shows the experimental results for clustering DNA sequences using the three methods: SSAW, WFV, and
<inline-formula id="IEq18">
<alternatives>
<tex-math id="M65">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M66">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq18.gif"></inline-graphic>
</alternatives>
</inline-formula>
. The F-score is computed by combining values for precision and recall. Hence, for brevity, in the following, we will focus on F-score comparison. However, values for precision and recall will also be listed for reference purposes. From Table 
<xref rid="Tab4" ref-type="table">4</xref>
, we can find that SSAW has the best overall performance on all the three DNA data sets.
<table-wrap id="Tab4">
<label>Table 4</label>
<caption>
<p>Comparison of the clustering results on DNA dataset</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">DNA-Data</th>
<th align="left">Model</th>
<th align="left">F-score</th>
<th align="left">Precision</th>
<th align="left">Recall</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">HOG100</td>
<td align="left">SSAW</td>
<td align="left">0.6099</td>
<td align="left">0.5953</td>
<td align="left">0.6648</td>
</tr>
<tr>
<td align="left">HOG100</td>
<td align="left">WFV</td>
<td align="left">0.5724</td>
<td align="left">0.5569</td>
<td align="left">0.6227</td>
</tr>
<tr>
<td align="left">HOG100</td>
<td align="left">
<inline-formula id="IEq19">
<alternatives>
<tex-math id="M67">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M68">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq19.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">0.5551</td>
<td align="left">0.5112</td>
<td align="left">0.6073</td>
</tr>
<tr>
<td align="left">HOG200</td>
<td align="left">SSAW</td>
<td align="left">0.5982</td>
<td align="left">0.5841</td>
<td align="left">0.6508</td>
</tr>
<tr>
<td align="left">HOG200</td>
<td align="left">WFV</td>
<td align="left">0.5635</td>
<td align="left">0.5610</td>
<td align="left">0.6214</td>
</tr>
<tr>
<td align="left">HOG200</td>
<td align="left">
<inline-formula id="IEq20">
<alternatives>
<tex-math id="M69">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M70">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq20.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">0.5788</td>
<td align="left">0.5364</td>
<td align="left">0.6285</td>
</tr>
<tr>
<td align="left">HOG300</td>
<td align="left">SSAW</td>
<td align="left">0.5961</td>
<td align="left">0.5869</td>
<td align="left">0.6421</td>
</tr>
<tr>
<td align="left">HOG300</td>
<td align="left">WFV</td>
<td align="left">0.5359</td>
<td align="left">0.5434</td>
<td align="left">0.5800</td>
</tr>
<tr>
<td align="left">HOG300</td>
<td align="left">
<inline-formula id="IEq21">
<alternatives>
<tex-math id="M71">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M72">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq21.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">0.5466</td>
<td align="left">0.5081</td>
<td align="left">0.5915</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>Table 
<xref rid="Tab5" ref-type="table">5</xref>
shows the classification results generated from three models on DNA datasets. In the classification, one measurement, accuracy which is known as a comprehensive indicator, is evaluated. Studying Table 
<xref rid="Tab5" ref-type="table">5</xref>
, the first impression is that three models have similar values which are very close to each other. Using the accuracy measure, SSAW was slightly better on two datasets, HOG200 and HOG300, while
<inline-formula id="IEq22">
<alternatives>
<tex-math id="M73">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M74">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq22.gif"></inline-graphic>
</alternatives>
</inline-formula>
was slightly better on HOG100. If we compare the F-score values, WFV was better on two datasets (HOG100 and HOG200), while SSAW was better on HOG300. Practically, we can say that these three models have similar performance, and that SSAW is competitive in this experiment.
<table-wrap id="Tab5">
<label>Table 5</label>
<caption>
<p>Comparison of the classification results on DNA datasets</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">DNA-Data</th>
<th align="left">Model</th>
<th align="left">Accuracy</th>
<th align="left">F-score</th>
<th align="left">Precision</th>
<th align="left">Recall</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">HOG100</td>
<td align="left">SSAW</td>
<td align="left">0.9576</td>
<td align="left">0.9315</td>
<td align="left">0.9326</td>
<td align="left">0.9305</td>
</tr>
<tr>
<td align="left">HOG100</td>
<td align="left">WFV</td>
<td align="left">0.9574</td>
<td align="left">0.9426</td>
<td align="left">0.9475</td>
<td align="left">0.9447</td>
</tr>
<tr>
<td align="left">HOG100</td>
<td align="left">
<inline-formula id="IEq23">
<alternatives>
<tex-math id="M75">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M76">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq23.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">0.9587</td>
<td align="left">0.9335</td>
<td align="left">0.9472</td>
<td align="left">0.9202</td>
</tr>
<tr>
<td align="left">HOG200</td>
<td align="left">SSAW</td>
<td align="left">0.9548</td>
<td align="left">0.9256</td>
<td align="left">0.9366</td>
<td align="left">0.9149</td>
</tr>
<tr>
<td align="left">HOG200</td>
<td align="left">WFV</td>
<td align="left">0.9544</td>
<td align="left">0.9355</td>
<td align="left">0.9430</td>
<td align="left">0.9350</td>
</tr>
<tr>
<td align="left">HOG200</td>
<td align="left">
<inline-formula id="IEq24">
<alternatives>
<tex-math id="M77">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M78">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq24.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">0.9439</td>
<td align="left">0.9320</td>
<td align="left">0.9331</td>
<td align="left">0.9309</td>
</tr>
<tr>
<td align="left">HOG300</td>
<td align="left">SSAW</td>
<td align="left">0.9509</td>
<td align="left">0.9311</td>
<td align="left">0.9354</td>
<td align="left">0.9268</td>
</tr>
<tr>
<td align="left">HOG300</td>
<td align="left">WFV</td>
<td align="left">0.9402</td>
<td align="left">0.9208</td>
<td align="left">0.9286</td>
<td align="left">0.9219</td>
</tr>
<tr>
<td align="left">HOG300</td>
<td align="left">
<inline-formula id="IEq25">
<alternatives>
<tex-math id="M79">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M80">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq25.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">0.9328</td>
<td align="left">0.9255</td>
<td align="left">0.9229</td>
<td align="left">0.9282</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>Table 
<xref rid="Tab6" ref-type="table">6</xref>
shows the corresponding running times for the three analysis methods in clustering and classification on DNA datasets. From Table 
<xref rid="Tab6" ref-type="table">6</xref>
, we can observe that for clustering, SSAW is the fastest method among the three. It runs much faster than WFV by as much as 3, 5, and 10 fold increases in speed. For classification of DNA sequences, WFV was the fastest method among these three methods.
<inline-formula id="IEq26">
<alternatives>
<tex-math id="M81">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M82">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq26.gif"></inline-graphic>
</alternatives>
</inline-formula>
was faster than SSAW on two of the three data sets, but slower on one dataset.
<table-wrap id="Tab6">
<label>Table 6</label>
<caption>
<p>Running time for clustering and classification on DNA datasets. The fold improvement from a given method to the proposed SSAW approach is listed inside the parenthesis</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">DNA-Data</th>
<th align="left">Model</th>
<th align="left">Total</th>
<th align="left">Total</th>
</tr>
<tr>
<th align="left"></th>
<th align="left"></th>
<th align="left">clustering time</th>
<th align="left">classification time</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">HOG100</td>
<td align="left">SSAW</td>
<td align="left">19.8000</td>
<td align="left">16.8159</td>
</tr>
<tr>
<td align="left">HOG100</td>
<td align="left">WFV</td>
<td align="left">55.4619(3)</td>
<td align="left">10.4614</td>
</tr>
<tr>
<td align="left">HOG100</td>
<td align="left">
<inline-formula id="IEq27">
<alternatives>
<tex-math id="M83">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M84">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq27.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">39.676(2)</td>
<td align="left">11.3421</td>
</tr>
<tr>
<td align="left">HOG200</td>
<td align="left">SSAW</td>
<td align="left">50.9515</td>
<td align="left">51.5956</td>
</tr>
<tr>
<td align="left">HOG200</td>
<td align="left">WFV</td>
<td align="left">238.5061(5)</td>
<td align="left">26.8309</td>
</tr>
<tr>
<td align="left">HOG200</td>
<td align="left">
<inline-formula id="IEq28">
<alternatives>
<tex-math id="M85">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M86">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq28.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">104.327(2)</td>
<td align="left">37.8473</td>
</tr>
<tr>
<td align="left">HOG300</td>
<td align="left">SSAW</td>
<td align="left">63.9960</td>
<td align="left">77.7017</td>
</tr>
<tr>
<td align="left">HOG300</td>
<td align="left">WFV</td>
<td align="left">640.1409(10)</td>
<td align="left">31.4625</td>
</tr>
<tr>
<td align="left">HOG300</td>
<td align="left">
<inline-formula id="IEq29">
<alternatives>
<tex-math id="M87">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M88">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq29.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">238.712(4)</td>
<td align="left">94.8274</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>Combining the performance of these three models, we can note the following: (1) For clustering, the recommended method is SSAW, it not only has the best performance, but also has the fastest running time. (2) For classification, WFV would be the best choice which has the advantage of performance plus running time. However, SSAW demonstrated competitive performance, with respect to both accuracy and running time.</p>
</sec>
<sec id="Sec18">
<title>Protein data</title>
<p>Table 
<xref rid="Tab7" ref-type="table">7</xref>
shows the clustering results on the protein sequence data. In all three data subsets, SSAW was the best.
<table-wrap id="Tab7">
<label>Table 7</label>
<caption>
<p>Comparison of the cluster results on protein data set</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Protein-Data</th>
<th align="left">Model</th>
<th align="left">F-score</th>
<th align="left">Precision</th>
<th align="left">Recall</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">HOG100</td>
<td align="left">SSAW</td>
<td align="left">0.7651</td>
<td align="left">0.7497</td>
<td align="left">0.8001</td>
</tr>
<tr>
<td align="left">HOG100</td>
<td align="left">WFV</td>
<td align="left">0.5874</td>
<td align="left">0.5687</td>
<td align="left">0.6382</td>
</tr>
<tr>
<td align="left">HOG100</td>
<td align="left">
<inline-formula id="IEq30">
<alternatives>
<tex-math id="M89">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M90">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq30.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">0.6604</td>
<td align="left">0.642</td>
<td align="left">0.6798</td>
</tr>
<tr>
<td align="left">HOG200</td>
<td align="left">SSAW</td>
<td align="left">0.7746</td>
<td align="left">0.7573</td>
<td align="left">0.8103</td>
</tr>
<tr>
<td align="left">HOG200</td>
<td align="left">WFV</td>
<td align="left">0.6410</td>
<td align="left">0.6195</td>
<td align="left">0.6913</td>
</tr>
<tr>
<td align="left">HOG200</td>
<td align="left">
<inline-formula id="IEq31">
<alternatives>
<tex-math id="M91">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M92">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq31.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">0.6435</td>
<td align="left">0.5969</td>
<td align="left">0.6979</td>
</tr>
<tr>
<td align="left">HOG300</td>
<td align="left">SSAW</td>
<td align="left">0.7246</td>
<td align="left">0.7088</td>
<td align="left">0.7653</td>
</tr>
<tr>
<td align="left">HOG300</td>
<td align="left">WFV</td>
<td align="left">0.5016</td>
<td align="left">0.4826</td>
<td align="left">0.5551</td>
</tr>
<tr>
<td align="left">HOG300</td>
<td align="left">
<inline-formula id="IEq32">
<alternatives>
<tex-math id="M93">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M94">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq32.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">0.6429</td>
<td align="left">0.6111</td>
<td align="left">0.6782</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>Table 
<xref rid="Tab8" ref-type="table">8</xref>
shows the classification results generated using these three methods on protein data sets. Using accuracy for performance measurement, SSAW was the best on two data sets (HOG200 and HOG300), while
<inline-formula id="IEq33">
<alternatives>
<tex-math id="M95">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M96">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq33.gif"></inline-graphic>
</alternatives>
</inline-formula>
performed best on the other data (HOG100). Using F-score, SSAW was best on HOG300 and
<inline-formula id="IEq34">
<alternatives>
<tex-math id="M97">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M98">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq34.gif"></inline-graphic>
</alternatives>
</inline-formula>
was the best on the other two data subsets. Generally speaking, SSAW and
<inline-formula id="IEq35">
<alternatives>
<tex-math id="M99">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M100">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq35.gif"></inline-graphic>
</alternatives>
</inline-formula>
were quite competitive in this experiment, while WFV generated inferior results. Table 
<xref rid="Tab9" ref-type="table">9</xref>
shows the running time in clustering and classification on protein datasets. In all protein data sets and two applications, SSAW outperformed the other two methods overwhelmingly. WFV was the runner up, while
<inline-formula id="IEq36">
<alternatives>
<tex-math id="M101">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M102">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq36.gif"></inline-graphic>
</alternatives>
</inline-formula>
could not compete on this dataset.
<table-wrap id="Tab8">
<label>Table 8</label>
<caption>
<p>Comparison of the classification results on protein data</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Data</th>
<th align="left">Model</th>
<th align="left">Accuracy</th>
<th align="left">F-score</th>
<th align="left">Precision</th>
<th align="left">Recall</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">HOG100</td>
<td align="left">SSAW</td>
<td align="left">0.8158</td>
<td align="left">0.6274</td>
<td align="left">0.6225</td>
<td align="left">0.6644</td>
</tr>
<tr>
<td align="left">HOG100</td>
<td align="left">WFV</td>
<td align="left">0.6741</td>
<td align="left">0.5092</td>
<td align="left">0.5012</td>
<td align="left">0.5518</td>
</tr>
<tr>
<td align="left">HOG100</td>
<td align="left">
<inline-formula id="IEq37">
<alternatives>
<tex-math id="M103">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M104">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq37.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">0.8329</td>
<td align="left">0.6540</td>
<td align="left">0.6248</td>
<td align="left">0.6861</td>
</tr>
<tr>
<td align="left">HOG200</td>
<td align="left">SSAW</td>
<td align="left">0.8222</td>
<td align="left">0.5626</td>
<td align="left">0.5441</td>
<td align="left">0.6174</td>
</tr>
<tr>
<td align="left">HOG200</td>
<td align="left">WFV</td>
<td align="left">0.7051</td>
<td align="left">0.4454</td>
<td align="left">0.4359</td>
<td align="left">0.4902</td>
</tr>
<tr>
<td align="left">HOG200</td>
<td align="left">
<inline-formula id="IEq38">
<alternatives>
<tex-math id="M105">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M106">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq38.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">0.8061</td>
<td align="left">0.6279</td>
<td align="left">0.5875</td>
<td align="left">0.6743</td>
</tr>
<tr>
<td align="left">HOG300</td>
<td align="left">SSAW</td>
<td align="left">0.8690</td>
<td align="left">0.7345</td>
<td align="left">0.7466</td>
<td align="left">0.7642</td>
</tr>
<tr>
<td align="left">HOG300</td>
<td align="left">WFV</td>
<td align="left">0.5685</td>
<td align="left">0.3468</td>
<td align="left">0.3551</td>
<td align="left">0.3774</td>
</tr>
<tr>
<td align="left">HOG300</td>
<td align="left">
<inline-formula id="IEq39">
<alternatives>
<tex-math id="M107">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M108">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq39.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">0.8098</td>
<td align="left">0.6308</td>
<td align="left">0.5983</td>
<td align="left">0.6670</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="Tab9">
<label>Table 9</label>
<caption>
<p>Running time for clustering and classification on protein datasets. The fold improvement from the a given method to the proposed SSAW is listed inside the parenthesis</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Protein-data</th>
<th align="left">Models</th>
<th align="left">Total clustering</th>
<th align="left">Total classification</th>
</tr>
<tr>
<th align="left"></th>
<th align="left"></th>
<th align="left">time</th>
<th align="left">time</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">HOG100</td>
<td align="left">SSAW</td>
<td align="left">0.1638</td>
<td align="left">0.1262</td>
</tr>
<tr>
<td align="left">HOG100</td>
<td align="left">WFV</td>
<td align="left">5.5554(34)</td>
<td align="left">0.4164(3)</td>
</tr>
<tr>
<td align="left">HOG100</td>
<td align="left">
<inline-formula id="IEq40">
<alternatives>
<tex-math id="M109">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M110">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq40.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">10.964(67)</td>
<td align="left">1.3780(11)</td>
</tr>
<tr>
<td align="left">HOG200</td>
<td align="left">SSAW</td>
<td align="left">0.3542</td>
<td align="left">0.2738</td>
</tr>
<tr>
<td align="left">HOG200</td>
<td align="left">WFV</td>
<td align="left">11.5037(32)</td>
<td align="left">0.9362(3)</td>
</tr>
<tr>
<td align="left">HOG200</td>
<td align="left">
<inline-formula id="IEq41">
<alternatives>
<tex-math id="M111">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M112">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq41.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">49.016(138)</td>
<td align="left">3.091(11)</td>
</tr>
<tr>
<td align="left">HOG300</td>
<td align="left">SSAW</td>
<td align="left">0.6965</td>
<td align="left">0.5077</td>
</tr>
<tr>
<td align="left">HOG300</td>
<td align="left">WFV</td>
<td align="left">27.2514(39)</td>
<td align="left">1.7460(3)</td>
</tr>
<tr>
<td align="left">HOG300</td>
<td align="left">
<inline-formula id="IEq42">
<alternatives>
<tex-math id="M113">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M114">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq42.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">126.984(182)</td>
<td align="left">5.284(10)</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>Taken together, we can make a few notes on working with protein datasets: (1) SSAW generally has the best performance on clustering and classification using the protein datasets. (2) SSAW also has the fastest running time. (3) The
<inline-formula id="IEq43">
<alternatives>
<tex-math id="M115">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M116">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq43.gif"></inline-graphic>
</alternatives>
</inline-formula>
was better than WFV on some cases, however, the required execution time was higher than that of WFV. (4) For WFV, the running time was second to SSAW, however, the accuracy was not as good. Overall, it appears that, when the alphabet size is increasing, the proposed SSAW method with its initial stage of mapping the
<italic>k</italic>
-mers to complex numbers based on the unit circle, produces superior results than the state-of-art.</p>
</sec>
<sec id="Sec19">
<title>Simulated data</title>
<p>Table 
<xref rid="Tab10" ref-type="table">10</xref>
shows the results for clustering using the simulated datasets. We can see from Table 
<xref rid="Tab10" ref-type="table">10</xref>
,
<inline-formula id="IEq44">
<alternatives>
<tex-math id="M117">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M118">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq44.gif"></inline-graphic>
</alternatives>
</inline-formula>
is the best one among these three methods. Comparing SSAW to WFV, WFV is slightly better than SSAW, although their performance numbers are quite close.
<table-wrap id="Tab10">
<label>Table 10</label>
<caption>
<p>Comparison of the clustering results on simulated dataset</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Model</th>
<th align="left">F-score</th>
<th align="left">Precision</th>
<th align="left">Recall</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">SSAW</td>
<td align="left">0.8151</td>
<td align="left">0.8085</td>
<td align="left">0.8467</td>
</tr>
<tr>
<td align="left">WFV</td>
<td align="left">0.8211</td>
<td align="left">0.8056</td>
<td align="left">0.8587</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq45">
<alternatives>
<tex-math id="M119">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M120">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq45.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">0.8584</td>
<td align="left">0.8750</td>
<td align="left">0.8425</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>Table 
<xref rid="Tab11" ref-type="table">11</xref>
compares the classification results of these three methods using the simulated data. WFV is the best one among the three. SSAW is second, performing better than
<inline-formula id="IEq46">
<alternatives>
<tex-math id="M121">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M122">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq46.gif"></inline-graphic>
</alternatives>
</inline-formula>
.
<table-wrap id="Tab11">
<label>Table 11</label>
<caption>
<p>Comparison of the classification results on simulated data</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Model</th>
<th align="left">Accuracy</th>
<th align="left">F-score</th>
<th align="left">Precision</th>
<th align="left">Recall</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">SSAW</td>
<td align="left">0.9789</td>
<td align="left">0.9789</td>
<td align="left">0.9804</td>
<td align="left">0.9789</td>
</tr>
<tr>
<td align="left">WFV</td>
<td align="left">0.9992</td>
<td align="left">0.9992</td>
<td align="left">0.9993</td>
<td align="left">0.9992</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq47">
<alternatives>
<tex-math id="M123">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M124">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq47.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">0.9607</td>
<td align="left">0.9662</td>
<td align="left">0.9696</td>
<td align="left">0.9628</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>Table 
<xref rid="Tab12" ref-type="table">12</xref>
describes the running times for these three methods on simulated data. Comparing three models, SSAW was the fastest.
<inline-formula id="IEq48">
<alternatives>
<tex-math id="M125">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M126">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq48.gif"></inline-graphic>
</alternatives>
</inline-formula>
is the slowest in clustering. For clustering, the running times for
<inline-formula id="IEq49">
<alternatives>
<tex-math id="M127">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M128">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq49.gif"></inline-graphic>
</alternatives>
</inline-formula>
and WFV were respectively, 18 and 15 times slower, than those of SSAW. In classification, the running time of
<inline-formula id="IEq50">
<alternatives>
<tex-math id="M129">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M130">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq50.gif"></inline-graphic>
</alternatives>
</inline-formula>
and WFV were 11 and 2 times slower, respectively.
<table-wrap id="Tab12">
<label>Table 12</label>
<caption>
<p>Running time for three methods on clustering and classification using simulated data</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Models</th>
<th align="left">Total clustering</th>
<th align="left">Total classification</th>
</tr>
<tr>
<th align="left"></th>
<th align="left">time</th>
<th align="left">time</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">SSAW</td>
<td align="left">0.0632</td>
<td align="left">0.0810</td>
</tr>
<tr>
<td align="left">WFV</td>
<td align="left">0.9288(15)</td>
<td align="left">0.9313(11)</td>
</tr>
<tr>
<td align="left">
<inline-formula id="IEq51">
<alternatives>
<tex-math id="M131">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M132">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq51.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td align="left">1.123(18)</td>
<td align="left">0.172(2)</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>Combining the performance and speed, we can note the following with respect to the simulated data: (1) SSAW and WFV can be recommended methods for clustering. The running time of
<inline-formula id="IEq52">
<alternatives>
<tex-math id="M133">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M134">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq52.gif"></inline-graphic>
</alternatives>
</inline-formula>
is relatively high – 18 times more than SSAW and 1.2 times more than WFV. (2) For classification, SSAW is a good choice, with competitive performance and the fastest running time. WFV is the most accurate method, however, it has longer running time (11 times more than SSAW, and 5.4 times more than
<inline-formula id="IEq53">
<alternatives>
<tex-math id="M135">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M136">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq53.gif"></inline-graphic>
</alternatives>
</inline-formula>
).</p>
<p>Considering the three types of data used in the experiments, and the two applications considered, we can draw some overall conclusions. Table 
<xref rid="Tab13" ref-type="table">13</xref>
summarizes the overall results of our analysis.
<table-wrap id="Tab13">
<label>Table 13</label>
<caption>
<p>Recommended methods for clustering and classification given three datasets. Model inside parentheses is competitive</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Data</th>
<th align="left">Cluster</th>
<th align="left">Classification</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">DNA</td>
<td align="left">SSAW</td>
<td align="left">WFV(SSAW)</td>
</tr>
<tr>
<td align="left">Protein</td>
<td align="left">SSAW</td>
<td align="left">SSAW</td>
</tr>
<tr>
<td align="left">Simulated</td>
<td align="left">SSAW(WFV)</td>
<td align="left">SSAW</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
</sec>
</sec>
<sec id="Sec20" sec-type="discussion">
<title>Discussion</title>
<p>The proposed SSAW is inspired by the work WFV reported in [
<xref ref-type="bibr" rid="CR41">41</xref>
]. In Bao et al.’s work [
<xref ref-type="bibr" rid="CR41">41</xref>
], WFV was compared to five state-of-the-art methods, namely,
<italic>k</italic>
-tuple [
<xref ref-type="bibr" rid="CR4">4</xref>
,
<xref ref-type="bibr" rid="CR30">30</xref>
], DMK [
<xref ref-type="bibr" rid="CR31">31</xref>
], TSM [
<xref ref-type="bibr" rid="CR36">36</xref>
], AMI [
<xref ref-type="bibr" rid="CR29">29</xref>
] and CV [
<xref ref-type="bibr" rid="CR32">32</xref>
] on DNA data set. WFV demonstrated overwhelming superiority over each of these methods. Because the proposed SSAW are better than WFV in clustering on each of the three types of data considered, we can expect that SSAW will have competitive (if not better) performance (with respect to both accuracy and speed) when compared against these five state-of-the-art methods. Classification performance was not examined in the original Bao et al.’s work [
<xref ref-type="bibr" rid="CR41">41</xref>
].</p>
<p>Similarly, in [
<xref ref-type="bibr" rid="CR18">18</xref>
], the
<inline-formula id="IEq54">
<alternatives>
<tex-math id="M137">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M138">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq54.gif"></inline-graphic>
</alternatives>
</inline-formula>
method was compared to over 9 other alignment-free algorithms, especially, those that consider sequences in a pairwise manner (such as the general
<italic>D</italic>
<sub>2</sub>
-family). The
<inline-formula id="IEq55">
<alternatives>
<tex-math id="M139">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M140">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq55.gif"></inline-graphic>
</alternatives>
</inline-formula>
was shown to outperform most of the methods in this category. Thus, we expect that the relative performance of the proposed SSAW method over
<inline-formula id="IEq56">
<alternatives>
<tex-math id="M141">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M142">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq56.gif"></inline-graphic>
</alternatives>
</inline-formula>
gives us an idea on how it will perform when compared with the
<italic>D</italic>
<sub>2</sub>
-family, and other methods investigated in [
<xref ref-type="bibr" rid="CR18">18</xref>
].</p>
<p>SSAW generally outperformed WFV with respect to accuracy, and the F-score measure. The performance improvement of SSAW over WFV can be attributed to two key factors: (1) the use of the stationary discrete wavelet transform which is able to keep information better during the transformation process than the standard discrete wavelet transform used in [
<xref ref-type="bibr" rid="CR41">41</xref>
]; (2) The use of an improved representation for the
<italic>k</italic>
-mers, based on the initial mapping to complex numbers using the unit circle, before performing the wavelet transformation.</p>
<p>For clustering, SSAW outperformed
<inline-formula id="IEq57">
<alternatives>
<tex-math id="M143">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M144">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq57.gif"></inline-graphic>
</alternatives>
</inline-formula>
. This could be due to several reasons, for instance, the two points already mentioned above. Further, while
<inline-formula id="IEq58">
<alternatives>
<tex-math id="M145">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M146">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq58.gif"></inline-graphic>
</alternatives>
</inline-formula>
needs to compare sequences pair by pair, SSAW and WFV do not need to compare two sequences in a pairwise manner. Rather, they generate a series of numbers to represent all sequences together which are then transformed into a feature vector. Hence, these two wavelet-based methods are more suitable for clustering than
<inline-formula id="IEq59">
<alternatives>
<tex-math id="M147">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M148">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq59.gif"></inline-graphic>
</alternatives>
</inline-formula>
.</p>
<p>Comparing WFV and SSAW in classification on DNA sequences, for short sequence (less than 1000 bp), SSAW produced better results. SSAW was slower on DNA classification which had relatively longer sequences (i.e, DNA data with an average sequence length of 1495 bp). It appears that SSAW is not suitable for long sequences, from a small alphabet. However, for larger alphabets, such as protein sequences (with an average sequence length of 497 bp), SSAW showed superior performance over both WFV and
<inline-formula id="IEq60">
<alternatives>
<tex-math id="M149">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M150">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq60.gif"></inline-graphic>
</alternatives>
</inline-formula>
.</p>
<p>SSAW did not perform well in generating the phylogenetic tree and in evaluating functionally related regulatory sequences. This is not too surprising, given the observed performance of WFV on these problems (see [
<xref ref-type="bibr" rid="CR18">18</xref>
] for comparison with
<inline-formula id="IEq61">
<alternatives>
<tex-math id="M151">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K_{2}^{*}$\end{document}</tex-math>
<mml:math id="M152">
<mml:msubsup>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="12859_2018_2155_Article_IEq61.gif"></inline-graphic>
</alternatives>
</inline-formula>
).</p>
<p>The distance measurement used in SSAW is based on the simple Eucliean distance between two vectors. Luczak et al. [
<xref ref-type="bibr" rid="CR5">5</xref>
] provided a recent comprehensive survey using different statistics to evaluate sequence similarity in alginment-free methods. After studying over 30 statistics (more than 10 basic measurements and their combinations), Luczak et al. [
<xref ref-type="bibr" rid="CR5">5</xref>
] showed that simple single statistics are sufficient in alignment-free
<italic>k</italic>
-mer based similarity measurement. The Eucliean distance approach used in this work is thus just one approach to the distance measurement. Certainly, other distance measures, such as the earth mover distance, can be considered to further improve the proposed SSAW approach. Similarly, classification and clustering were performend using simple algorithms. Further improvement may be realized with more sophisticated analysis methods, e.g., using random forests for classification.</p>
<p>One of the main advantages of SSAW is the running time. SSAW is much faster than the other two methods, showing orders of magnitude improvement in execution time, while maintaining competitive (if not better) accuracy. Considering the huge volumes of data involved in most modern applications, and the rate at which these datasets are being generated, the rapid processing speed of alignment-free methods becomes a key factor. The proposed SSAW provides very rapid processing, without an undue loss in accuracy. This makes SSAW an attractive approach in most practical scenarios.</p>
</sec>
<sec id="Sec21" sec-type="conclusion">
<title>Conclusions</title>
<p>A new alignment-free model for similarity assessment is proposed. We call it SSAW – Sequence Similarity Analysis using the Stationary Discrete Wavelet Transform. Three types of data are used in the study, DNA sequences, protein sequences, and simulated next-generation sequences. Two different applications, clustering and classification are considered. Compared with state-of-the-art methods, WFV, and
<italic>K</italic>
<sub>2</sub>
∗, the proposed SSAW demonstrated competitive performance (accuracy, F-score, precision, and recall) both in clustering and classification. It also exhibited faster running times compared with the other methods. These make SSAW a practical approach to rapid sequence analysis, suitable for dealing with rapidly increasing volumes of sequence data required in most modern biological applications.</p>
</sec>
</body>
<back>
<glossary>
<title>Abbreviations</title>
<def-list>
<def-item>
<term>AMI</term>
<def>
<p>Average mutual information model which is proposed in paper [
<xref ref-type="bibr" rid="CR29">29</xref>
]</p>
</def>
</def-item>
<def-item>
<term>CPU</term>
<def>
<p>Central processing unit</p>
</def>
</def-item>
<def-item>
<term>CV</term>
<def>
<p>A method which is proposed in paper [
<xref ref-type="bibr" rid="CR32">32</xref>
]</p>
</def>
</def-item>
<def-item>
<term>CWT</term>
<def>
<p>Continuous wavelet transformation</p>
</def>
</def-item>
<def-item>
<term>DNA</term>
<def>
<p>Deoxyribonucleic acid</p>
</def>
</def-item>
<def-item>
<term>DMK</term>
<def>
<p>Distance measure based on
<italic>k</italic>
-tuples model which is proposed in paper [
<xref ref-type="bibr" rid="CR31">31</xref>
]</p>
</def>
</def-item>
<def-item>
<term>DWT</term>
<def>
<p>Discrete wavelet transform</p>
</def>
</def-item>
<def-item>
<term>FFT</term>
<def>
<p>Fast fourier transformation</p>
</def>
</def-item>
<def-item>
<term>FN</term>
<def>
<p>False negative</p>
</def>
</def-item>
<def-item>
<term>FP</term>
<def>
<p>False positive</p>
</def>
</def-item>
<def-item>
<term>GHz</term>
<def>
<p>Giga-Hertz</p>
</def>
</def-item>
<def-item>
<term>GB</term>
<def>
<p>Gigabyte</p>
</def>
</def-item>
<def-item>
<term>MATLAB</term>
<def>
<p>A software package which is developed by Mathworks Inc, Natick, MA, USA,
<ext-link ext-link-type="uri" xlink:href="https://www.mathworks.com/">https://www.mathworks.com/</ext-link>
</p>
</def>
</def-item>
<def-item>
<term>MRF</term>
<def>
<p>Markov Random Field (MRF)</p>
</def>
</def-item>
<def-item>
<term>PBIL</term>
<def>
<p>PBIL is abbreviation of PRABI-Lyon-Gerland. It is the protein database which is created in January 1998, which is located at the institute of Biology and Chemistry of Proteins IBCP
<ext-link ext-link-type="uri" xlink:href="ftp://pbil.univ-lyon1.fr/pub/hogenom/release_06/">ftp://pbil.univ-lyon1.fr/pub/hogenom/release_06/</ext-link>
</p>
</def>
</def-item>
<def-item>
<term>RAM</term>
<def>
<p>Random access memory</p>
</def>
</def-item>
<def-item>
<term>SBARS</term>
<def>
<p>Spectral-based approach for repeats search method which is proposed in paper [
<xref ref-type="bibr" rid="CR42">42</xref>
]</p>
</def>
</def-item>
<def-item>
<term>SSAW</term>
<def>
<p>Sequence Similarity Analysis method based on the stationary discrete Wavelet transform</p>
</def>
</def-item>
<def-item>
<term>SWT</term>
<def>
<p>Stationary wavelet transform</p>
</def>
</def-item>
<def-item>
<term>TN</term>
<def>
<p>True negative</p>
</def>
</def-item>
<def-item>
<term>TP</term>
<def>
<p>True positive</p>
</def>
</def-item>
<def-item>
<term>TSM</term>
<def>
<p>Three symbolic sequences model which is proposed in paper [
<xref ref-type="bibr" rid="CR36">36</xref>
]</p>
</def>
</def-item>
<def-item>
<term>WFV</term>
<def>
<p>Wavelet-base feature vector model which is proposed in paper [
<xref ref-type="bibr" rid="CR41">41</xref>
]</p>
</def>
</def-item>
</def-list>
</glossary>
<fn-group>
<fn>
<p>
<bold>Author’s contributions</bold>
</p>
<p>JL and YJ contributed the idea and designed the study. JW implemented and performed most of the experiments. JL,JW,DA,BHJ and YJ wrote the manuscript. All authors read and approved the final manuscript.</p>
</fn>
</fn-group>
<ack>
<title>Acknowledgements</title>
<p>The authors would like to thank professor Bao who provided the data and the source code of the paper [
<xref ref-type="bibr" rid="CR41">41</xref>
]. The authors would also like to thank the anonymous reviewers whose comments and suggestions have led to a significant improvement of this manuscript.</p>
<sec id="d29e6123">
<title>Funding</title>
<p>This work is supported in part by the Chinese National Natural Science Foundation (Grant No. 61472082), Natural Science Foundation of Fujian Province of China (Grant No. 2014J01220), Scientific Research Innovation Team Construction Program of Fujian Normal University (Grant No. IRTL1702), and the US National Science Foundation (Grant No. IIS-1552860).</p>
</sec>
<sec id="d29e6128">
<title>Availability of data and materials</title>
<p>The program codes and data used are avaliable at:
<ext-link ext-link-type="uri" xlink:href="http://community.wvu.edu/~daadjeroh/projects/SSAW/SSAWcodes.rar">http://community.wvu.edu/~daadjeroh/projects/SSAW/SSAWcodes.rar</ext-link>
</p>
<p>The DNA dataset comes from the article, A wavelet-based feature vector model for DNA clustering [
<xref ref-type="bibr" rid="CR41">41</xref>
], which is provided by the author of the paper, Dr. Bao. The protein dataset comes from the homologous dataset downloaded from the PBIL. URL:
<ext-link ext-link-type="uri" xlink:href="ftp://pbil.univ-lyon1.fr/pub/hogenom/release_06/">ftp://pbil.univ-lyon1.fr/pub/hogenom/release_06/</ext-link>
</p>
</sec>
</ack>
<notes notes-type="COI-statement">
<sec id="d29e6147">
<title>Ethics approval and consent to participate</title>
<p>Not applicable.</p>
</sec>
<sec id="d29e6152">
<title>Consent for publication</title>
<p>All authors consent this publication.</p>
</sec>
<sec id="d29e6157">
<title>Competing interests</title>
<p>The authors declared that they have no competing interests.</p>
</sec>
<sec id="d29e6162">
<title>Publisher’s Note</title>
<p>Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.</p>
</sec>
</notes>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1</label>
<mixed-citation publication-type="other">Gusfield D. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, 1st: Cambridge University Press; 1997.</mixed-citation>
</ref>
<ref id="CR2">
<label>2</label>
<mixed-citation publication-type="other">Adjeroh D, Bell T, Mukherjee A. The Burrows-Wheeler Transform:Data Compression, Suffix Arrays, and Pattern Matching, 1st: Springer Publishing Company; 2008.</mixed-citation>
</ref>
<ref id="CR3">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zielezinski</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Vinga</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Almeida</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Karlowski</surname>
<given-names>WM</given-names>
</name>
</person-group>
<article-title>Alignment-free sequence comparison: benefits, applications, and tools</article-title>
<source>Genome Biol</source>
<year>2017</year>
<volume>18</volume>
<issue>1</issue>
<fpage>186</fpage>
<pub-id pub-id-type="doi">10.1186/s13059-017-1319-7</pub-id>
<pub-id pub-id-type="pmid">28974235</pub-id>
</element-citation>
</ref>
<ref id="CR4">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vinga</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Almeida</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Alignment-free sequence comparison-a review</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>19</volume>
<issue>4</issue>
<fpage>513</fpage>
<lpage>23</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg005</pub-id>
<pub-id pub-id-type="pmid">12611807</pub-id>
</element-citation>
</ref>
<ref id="CR5">
<label>5</label>
<mixed-citation publication-type="other">Luczak BB, James BT, Girgis HZ. A survey and evaluations of histogram-based statistics in alignment-free sequence comparison. Briefings Bioinforma. 2017; online first bbx161.</mixed-citation>
</ref>
<ref id="CR6">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pratas</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Silva</surname>
<given-names>R. M</given-names>
</name>
<name>
<surname>Pinho</surname>
<given-names>A. J</given-names>
</name>
<name>
<surname>Ferreira</surname>
<given-names>PJSG</given-names>
</name>
</person-group>
<article-title>An alignment-free method to find and visualise rearrangements between pairs of DNA sequences</article-title>
<source>Sci Rep</source>
<year>2015</year>
<volume>5</volume>
<fpage>10203</fpage>
<pub-id pub-id-type="doi">10.1038/srep10203</pub-id>
<pub-id pub-id-type="pmid">25984837</pub-id>
</element-citation>
</ref>
<ref id="CR7">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Guillaume</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Roland</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Jens</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Bloom filter trie: an alignment-free and reference-free data structure for pan-genome storage:</article-title>
<source>Algoritm Mole Biol</source>
<year>2016</year>
<volume>11</volume>
<issue>1</issue>
<fpage>3</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1186/s13015-016-0066-8</pub-id>
</element-citation>
</ref>
<ref id="CR8">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pizzi</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Missmax: alignment-free sequence comparison with mismatches through filtering and heuristics</article-title>
<source>Algoritm Mol Biol</source>
<year>2016</year>
<volume>11</volume>
<issue>6</issue>
<fpage>1</fpage>
<lpage>10</lpage>
</element-citation>
</ref>
<ref id="CR9">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Thankachan</surname>
<given-names>SV</given-names>
</name>
<name>
<surname>Chockalingam</surname>
<given-names>SP</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Krishnan</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Aluru</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>A greedy alignment-free distance estimator for phylogenetic inference</article-title>
<source>BMC Bioinformatics</source>
<year>2017</year>
<volume>18</volume>
<issue>8</issue>
<fpage>238</fpage>
<pub-id pub-id-type="doi">10.1186/s12859-017-1658-0</pub-id>
<pub-id pub-id-type="pmid">28617225</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>He</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Rong</surname>
<given-names>LH</given-names>
</name>
<name>
<surname>Yau</surname>
<given-names>ST</given-names>
</name>
</person-group>
<article-title>A novel alignment-free vector method to cluster protein sequences</article-title>
<source>J Theor Biol</source>
<year>2017</year>
<volume>427</volume>
<fpage>41</fpage>
<pub-id pub-id-type="doi">10.1016/j.jtbi.2017.06.002</pub-id>
<pub-id pub-id-type="pmid">28587743</pub-id>
</element-citation>
</ref>
<ref id="CR11">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tripathi</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Pandey</surname>
<given-names>P. N</given-names>
</name>
</person-group>
<article-title>A novel alignment-free method to classify protein folding types by combining spectral graph clustering with Chou’s pseudo amino acid composition</article-title>
<source>J Theor Biol</source>
<year>2017</year>
<volume>424</volume>
<fpage>49</fpage>
<lpage>54</lpage>
<pub-id pub-id-type="doi">10.1016/j.jtbi.2017.04.027</pub-id>
<pub-id pub-id-type="pmid">28476562</pub-id>
</element-citation>
</ref>
<ref id="CR12">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pajuste</surname>
<given-names>FD</given-names>
</name>
<name>
<surname>Kaplinski</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Mols</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Puurand</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Lepamets</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Remm</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Fastgt: an alignment-free method for calling common snvs directly from raw sequencing reads</article-title>
<source>Sci Reports</source>
<year>2017</year>
<volume>7</volume>
<issue>1</issue>
<fpage>2537</fpage>
<pub-id pub-id-type="doi">10.1038/s41598-017-02487-5</pub-id>
</element-citation>
</ref>
<ref id="CR13">
<label>13</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rudewicz</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Soueidan</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Uricaru</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Bonnefoi</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Iggo</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Bergh</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Nikolski</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Micado - looking for mutations in targeted pacbio cancer data: An alignment-free method</article-title>
<source>Front Genet</source>
<year>2016</year>
<volume>7</volume>
<fpage>214</fpage>
<pub-id pub-id-type="doi">10.3389/fgene.2016.00214</pub-id>
<pub-id pub-id-type="pmid">28008336</pub-id>
</element-citation>
</ref>
<ref id="CR14">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cong</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Chan</surname>
<given-names>YB</given-names>
</name>
<name>
<surname>Ragan</surname>
<given-names>MA</given-names>
</name>
</person-group>
<article-title>A novel alignment-free method for detection of lateral genetic transfer based on tf-idf</article-title>
<source>Sci Rep</source>
<year>2016</year>
<volume>6</volume>
<fpage>30308</fpage>
<pub-id pub-id-type="doi">10.1038/srep30308</pub-id>
<pub-id pub-id-type="pmid">27453035</pub-id>
</element-citation>
</ref>
<ref id="CR15">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bromberg</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Grishin</surname>
<given-names>N. V</given-names>
</name>
<name>
<surname>Otwinowski</surname>
<given-names>Z</given-names>
</name>
</person-group>
<article-title>Phylogeny reconstruction with alignment-free method that corrects for horizontal gene transfer</article-title>
<source>Plos Comput Biol</source>
<year>2016</year>
<volume>12</volume>
<issue>6</issue>
<fpage>1004985</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pcbi.1004985</pub-id>
</element-citation>
</ref>
<ref id="CR16">
<label>16</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brittnacher</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Heltshe</surname>
<given-names>SL</given-names>
</name>
<name>
<surname>Hayden</surname>
<given-names>HS</given-names>
</name>
<name>
<surname>Radey</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Weiss</surname>
<given-names>EJ</given-names>
</name>
<name>
<surname>Damman</surname>
<given-names>CJ</given-names>
</name>
<name>
<surname>Zisman</surname>
<given-names>TL</given-names>
</name>
<name>
<surname>Suskind</surname>
<given-names>DL</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>SI</given-names>
</name>
</person-group>
<article-title>Gutss: An alignment-free sequence comparison method for use in human intestinal microbiome and fecal microbiota transplantation analysis</article-title>
<source>PLos ONE</source>
<year>2016</year>
<volume>11</volume>
<issue>7</issue>
<fpage>0158897</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0158897</pub-id>
</element-citation>
</ref>
<ref id="CR17">
<label>17</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pham</surname>
<given-names>DT</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Phan</surname>
<given-names>V</given-names>
</name>
</person-group>
<article-title>An accurate and fast alignment-free method for profiling microbial communities</article-title>
<source>J Bioinforma Comput Biol</source>
<year>2017</year>
<volume>15</volume>
<issue>3</issue>
<fpage>1740001</fpage>
<pub-id pub-id-type="doi">10.1142/S0219720017400017</pub-id>
</element-citation>
</ref>
<ref id="CR18">
<label>18</label>
<mixed-citation publication-type="other">Lin J, Adjeroh D. A, Jiang B. H, Jiang Y. K2 and k*2: Efficient alignment-free sequence similarity measurement based on kendall statistics. Bioinformatics. 2017;online first.</mixed-citation>
</ref>
<ref id="CR19">
<label>19</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yaveroglu</surname>
<given-names>O. N</given-names>
</name>
<name>
<surname>Milenkovic</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Przulj</surname>
<given-names>N</given-names>
</name>
</person-group>
<article-title>Proper evaluation of alignment-free network comparison methods</article-title>
<source>Bioinformatics</source>
<year>2015</year>
<volume>31</volume>
<issue>16</issue>
<fpage>2697</fpage>
<lpage>704</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btv170</pub-id>
<pub-id pub-id-type="pmid">25810431</pub-id>
</element-citation>
</ref>
<ref id="CR20">
<label>20</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qian</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Jun</surname>
<given-names>S. R</given-names>
</name>
<name>
<surname>Leuze</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ussery</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Nookaew</surname>
<given-names>I</given-names>
</name>
</person-group>
<article-title>Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer</article-title>
<source>Sci Rep</source>
<year>2017</year>
<volume>7</volume>
<fpage>40712</fpage>
<pub-id pub-id-type="doi">10.1038/srep40712</pub-id>
<pub-id pub-id-type="pmid">28102365</pub-id>
</element-citation>
</ref>
<ref id="CR21">
<label>21</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>He</surname>
<given-names>L</given-names>
</name>
<name>
<surname>He</surname>
<given-names>RL</given-names>
</name>
<name>
<surname>Yau</surname>
<given-names>SS</given-names>
</name>
</person-group>
<article-title>Zika and flaviviruses phylogeny based on the alignment-free natural vector method</article-title>
<source>DNA Cell Biol</source>
<year>2017</year>
<volume>36</volume>
<issue>2</issue>
<fpage>109</fpage>
<lpage>16</lpage>
<pub-id pub-id-type="doi">10.1089/dna.2016.3532</pub-id>
<pub-id pub-id-type="pmid">27977308</pub-id>
</element-citation>
</ref>
<ref id="CR22">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Golia</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Moeller</surname>
<given-names>GK</given-names>
</name>
<name>
<surname>Jankevicius</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Schmidt</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Hegele</surname>
<given-names>A</given-names>
</name>
<name>
<surname>PreiBer</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Mai</surname>
<given-names>LT</given-names>
</name>
<name>
<surname>Imhof</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Timinszky</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Alignment-free formula oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences</article-title>
<source>Nucleic Acids Res</source>
<year>2017</year>
<volume>45</volume>
<issue>1</issue>
<fpage>39</fpage>
<lpage>53</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkw904</pub-id>
<pub-id pub-id-type="pmid">27899557</pub-id>
</element-citation>
</ref>
<ref id="CR23">
<label>23</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Madsen</surname>
<given-names>MH</given-names>
</name>
<name>
<surname>Boher</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Hansen</surname>
<given-names>PE</given-names>
</name>
<name>
<surname>Jørgensen</surname>
<given-names>JF</given-names>
</name>
</person-group>
<article-title>Alignment-free characterization of 2d gratings</article-title>
<source>Appl Opt</source>
<year>2016</year>
<volume>55</volume>
<issue>2</issue>
<fpage>317</fpage>
<pub-id pub-id-type="doi">10.1364/AO.55.000317</pub-id>
<pub-id pub-id-type="pmid">26835768</pub-id>
</element-citation>
</ref>
<ref id="CR24">
<label>24</label>
<mixed-citation publication-type="other">Sandhya M, Prasad MVNK. k-nearest neighborhood structure (k-nns) based alignment-free method for fingerprint template protection. In: International Conference on Biometrics: 2015. p. 386–93.</mixed-citation>
</ref>
<ref id="CR25">
<label>25</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bonhamcarter</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Steele</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Bastola</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis</article-title>
<source>Brief Bioinforma</source>
<year>2014</year>
<volume>15</volume>
<issue>6</issue>
<fpage>890</fpage>
<lpage>905</lpage>
<pub-id pub-id-type="doi">10.1093/bib/bbt052</pub-id>
</element-citation>
</ref>
<ref id="CR26">
<label>26</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vinga</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Information theory applications for biological sequence analysis</article-title>
<source>Brief Bioinforma</source>
<year>2014</year>
<volume>15</volume>
<issue>3</issue>
<fpage>376</fpage>
<lpage>89</lpage>
<pub-id pub-id-type="doi">10.1093/bib/bbt068</pub-id>
</element-citation>
</ref>
<ref id="CR27">
<label>27</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Badger</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Kwong</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Kearney</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>An information-based sequence distance and its application to whole mitochondrial genome phylogeny</article-title>
<source>Bioinformatics</source>
<year>2001</year>
<volume>17</volume>
<issue>2</issue>
<fpage>149</fpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/17.2.149</pub-id>
<pub-id pub-id-type="pmid">11238070</pub-id>
</element-citation>
</ref>
<ref id="CR28">
<label>28</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dai</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Yao</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Integrating Overlapping structures and background information of words significantly improves biological sequence comparison</article-title>
<source>PLos ONE</source>
<year>2011</year>
<volume>6</volume>
<issue>11</issue>
<fpage>26779</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0026779</pub-id>
</element-citation>
</ref>
<ref id="CR29">
<label>29</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bauer</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Schuster</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Sayood</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>The average mutual information profile as a genomic signature</article-title>
<source>BMC Bioinformatics</source>
<year>2008</year>
<volume>9</volume>
<issue>1</issue>
<fpage>48</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-9-48</pub-id>
<pub-id pub-id-type="pmid">18218139</pub-id>
</element-citation>
</ref>
<ref id="CR30">
<label>30</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Blaisdell</surname>
<given-names>BE</given-names>
</name>
</person-group>
<article-title>A measure of the similarity of sets of sequences not requiring sequence alignment</article-title>
<source>Proc Natl Acad Sci USA</source>
<year>1986</year>
<volume>83</volume>
<issue>14</issue>
<fpage>5155</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.83.14.5155</pub-id>
<pub-id pub-id-type="pmid">3460087</pub-id>
</element-citation>
</ref>
<ref id="CR31">
<label>31</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dan</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>A novel hierarchical clustering algorithm for gene sequences</article-title>
<source>BMC Bioinformatics</source>
<year>2012</year>
<volume>13</volume>
<issue>1</issue>
<fpage>174</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-13-174</pub-id>
<pub-id pub-id-type="pmid">22823405</pub-id>
</element-citation>
</ref>
<ref id="CR32">
<label>32</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qi</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Hao</surname>
<given-names>B. I</given-names>
</name>
</person-group>
<article-title>Whole proteome prokaryote phylogeny without sequence alignment: A k-string composition approach</article-title>
<source>J Mole Evol</source>
<year>2004</year>
<volume>58</volume>
<issue>1</issue>
<fpage>1</fpage>
<lpage>11</lpage>
<pub-id pub-id-type="doi">10.1007/s00239-003-2493-7</pub-id>
</element-citation>
</ref>
<ref id="CR33">
<label>33</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pham</surname>
<given-names>T. D</given-names>
</name>
<name>
<surname>Zuegg</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>A probabilistic measure for alignment-free sequence comparison</article-title>
<source>Bioinformatics</source>
<year>2004</year>
<volume>20</volume>
<issue>18</issue>
<fpage>3455</fpage>
<lpage>61</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bth426</pub-id>
<pub-id pub-id-type="pmid">15271780</pub-id>
</element-citation>
</ref>
<ref id="CR34">
<label>34</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>TJ</given-names>
</name>
<name>
<surname>Burke</surname>
<given-names>JP</given-names>
</name>
<name>
<surname>Davison</surname>
<given-names>DB</given-names>
</name>
</person-group>
<article-title>A measure of dna sequence dissimilarity based on mahalanobis distance between frequencies of words</article-title>
<source>Biometrics</source>
<year>1997</year>
<volume>53</volume>
<issue>4</issue>
<fpage>1431</fpage>
<pub-id pub-id-type="doi">10.2307/2533509</pub-id>
<pub-id pub-id-type="pmid">9423258</pub-id>
</element-citation>
</ref>
<ref id="CR35">
<label>35</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>TJ</given-names>
</name>
<name>
<surname>Hsieh</surname>
<given-names>YC</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>LA</given-names>
</name>
</person-group>
<article-title>Statistical measures of dna sequence dissimilarity under markov chain models of base composition</article-title>
<source>Biometrics</source>
<year>2001</year>
<volume>57</volume>
<issue>2</issue>
<fpage>441</fpage>
<lpage>8</lpage>
<pub-id pub-id-type="doi">10.1111/j.0006-341X.2001.00441.x</pub-id>
<pub-id pub-id-type="pmid">11414568</pub-id>
</element-citation>
</ref>
<ref id="CR36">
<label>36</label>
<mixed-citation publication-type="other">Shi L, Huang H. DNA Sequences Analysis Based on Classifications of Nucleotide Bases. In: Affective Computing and Intelligent Interaction. 1st. Springer: 2012. p. 379–84.</mixed-citation>
</ref>
<ref id="CR37">
<label>37</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bai</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>A 2-d graphical representation of protein sequences based on nucleotide triplet codons</article-title>
<source>Chem Phys Lett</source>
<year>2005</year>
<volume>413</volume>
<issue>4</issue>
<fpage>458</fpage>
<lpage>62</lpage>
<pub-id pub-id-type="doi">10.1016/j.cplett.2005.08.011</pub-id>
</element-citation>
</ref>
<ref id="CR38">
<label>38</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Leimeister</surname>
<given-names>CA</given-names>
</name>
<name>
<surname>Boden</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Horwege</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Lindner</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Morgenstern</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Fast alignment-free sequence comparison using spaced-word frequencies</article-title>
<source>Bioinformatics</source>
<year>2014</year>
<volume>30</volume>
<issue>14</issue>
<fpage>1991</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btu177</pub-id>
<pub-id pub-id-type="pmid">24700317</pub-id>
</element-citation>
</ref>
<ref id="CR39">
<label>39</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Schimd</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Fast comparison of genomic and meta-genomic reads with alignment-free measures based on quality values</article-title>
<source>BMC Med Genet</source>
<year>2016</year>
<volume>9</volume>
<issue>1</issue>
<fpage>42</fpage>
<lpage>97</lpage>
</element-citation>
</ref>
<ref id="CR40">
<label>40</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schwende</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Pham</surname>
<given-names>TD</given-names>
</name>
</person-group>
<article-title>Pattern recognition and probabilistic measures in alignment-free sequence analysis</article-title>
<source>Brief Bioinforma</source>
<year>2014</year>
<volume>15</volume>
<issue>3</issue>
<fpage>354</fpage>
<lpage>68</lpage>
<pub-id pub-id-type="doi">10.1093/bib/bbt070</pub-id>
</element-citation>
</ref>
<ref id="CR41">
<label>41</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bao</surname>
<given-names>JP</given-names>
</name>
<name>
<surname>Yuan</surname>
<given-names>RY</given-names>
</name>
</person-group>
<article-title>A wavelet-based feature vector model for dna clustering</article-title>
<source>Gen Mole Res</source>
<year>2015</year>
<volume>14</volume>
<issue>4</issue>
<fpage>19163</fpage>
<lpage>72</lpage>
<pub-id pub-id-type="doi">10.4238/2015.December.29.26</pub-id>
</element-citation>
</ref>
<ref id="CR42">
<label>42</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pyatkov</surname>
<given-names>MI</given-names>
</name>
<name>
<surname>Pankratov</surname>
<given-names>AN</given-names>
</name>
</person-group>
<article-title>Sbars: fast creation of dotplots for dna sequences on different scales using ga-,gc-content</article-title>
<source>Bioinformatics</source>
<year>2014</year>
<volume>30</volume>
<issue>12</issue>
<fpage>1765</fpage>
<lpage>6</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btu095</pub-id>
<pub-id pub-id-type="pmid">24532721</pub-id>
</element-citation>
</ref>
<ref id="CR43">
<label>43</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cheever</surname>
<given-names>EA</given-names>
</name>
<name>
<surname>Overton</surname>
<given-names>GC</given-names>
</name>
<name>
<surname>Searls</surname>
<given-names>DB</given-names>
</name>
</person-group>
<article-title>Fast fourier transform-based correlation of dna sequences using complex plane encoding</article-title>
<source>Bioinformatics</source>
<year>1991</year>
<volume>7</volume>
<issue>2</issue>
<fpage>143</fpage>
<lpage>54</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/7.2.143</pub-id>
</element-citation>
</ref>
<ref id="CR44">
<label>44</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pal</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Ghosh</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Maji</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Bhattacharya</surname>
<given-names>DK</given-names>
</name>
</person-group>
<article-title>Use of fft in protein sequence comparison under their binary representations</article-title>
<source>Comput Mole Biosci</source>
<year>2016</year>
<volume>6</volume>
<issue>2</issue>
<fpage>33</fpage>
<lpage>40</lpage>
<pub-id pub-id-type="doi">10.4236/cmb.2016.62003</pub-id>
</element-citation>
</ref>
<ref id="CR45">
<label>45</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Grabherr</surname>
<given-names>MG</given-names>
</name>
<name>
<surname>Russell</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Meyer</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Mauceli</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Alföldi</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Di</surname>
<given-names>PF</given-names>
</name>
<name>
<surname>Lindblad-Toh</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>Genome-wide synteny through highly sensitive sequence alignment: Satsuma</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<issue>9</issue>
<fpage>1145</fpage>
<lpage>51</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq102</pub-id>
<pub-id pub-id-type="pmid">20208069</pub-id>
</element-citation>
</ref>
<ref id="CR46">
<label>46</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chaovalit</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Gangopadhyay</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Karabatis</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Z</given-names>
</name>
</person-group>
<article-title>Discrete wavelet transform-based time series analysis and mining</article-title>
<source>ACM Comput Surv</source>
<year>2011</year>
<volume>43</volume>
<issue>2</issue>
<fpage>1</fpage>
<lpage>37</lpage>
<pub-id pub-id-type="doi">10.1145/1883612.1883613</pub-id>
</element-citation>
</ref>
<ref id="CR47">
<label>47</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tsonis</surname>
<given-names>AA</given-names>
</name>
<name>
<surname>Kumar</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Wavelet analysis of dna sequences</article-title>
<source>Phys Rev E</source>
<year>1996</year>
<volume>53</volume>
<issue>2</issue>
<fpage>1828</fpage>
<pub-id pub-id-type="doi">10.1103/PhysRevE.53.1828</pub-id>
</element-citation>
</ref>
<ref id="CR48">
<label>48</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Haimovich</surname>
<given-names>AD</given-names>
</name>
<name>
<surname>Byrne</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Ramaswamy</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Welsh</surname>
<given-names>WJ</given-names>
</name>
</person-group>
<article-title>Wavelet analysis of dna walks</article-title>
<source>J Comput Biol</source>
<year>2006</year>
<volume>13</volume>
<issue>7</issue>
<fpage>1289</fpage>
<lpage>98</lpage>
<pub-id pub-id-type="doi">10.1089/cmb.2006.13.1289</pub-id>
<pub-id pub-id-type="pmid">17037959</pub-id>
</element-citation>
</ref>
<ref id="CR49">
<label>49</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nanni</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Brahnam</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Lumini</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Combining multiple approaches for gene microarray classification</article-title>
<source>Bioinformatics</source>
<year>2012</year>
<volume>28</volume>
<issue>8</issue>
<fpage>1151</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bts108</pub-id>
<pub-id pub-id-type="pmid">22390939</pub-id>
</element-citation>
</ref>
<ref id="CR50">
<label>50</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Abbasi</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Rostami</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Karimian</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Identification of exonic regions in dna sequences using cross-correlation and noise suppression by discrete wavelet transform</article-title>
<source>BMC Bioinformatics</source>
<year>2011</year>
<volume>12</volume>
<issue>1</issue>
<fpage>430</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-12-430</pub-id>
<pub-id pub-id-type="pmid">22050630</pub-id>
</element-citation>
</ref>
<ref id="CR51">
<label>51</label>
<mixed-citation publication-type="other">Padole M. C. Dimensionality reduction of dna sequences using wavelet transforms. In: World Congress : Applied Computing Conference: 2013. p. p145–52.</mixed-citation>
</ref>
<ref id="CR52">
<label>52</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Athanasiadis</surname>
<given-names>EI</given-names>
</name>
<name>
<surname>Cavouras</surname>
<given-names>DA</given-names>
</name>
<name>
<surname>Glotsos</surname>
<given-names>DT</given-names>
</name>
<name>
<surname>Georgiadis</surname>
<given-names>PV</given-names>
</name>
<name>
<surname>Kalatzis</surname>
<given-names>IK</given-names>
</name>
<name>
<surname>Nikiforidis</surname>
<given-names>GC</given-names>
</name>
</person-group>
<article-title>Segmentation of complementary dna microarray images by wavelet-based markov random field model</article-title>
<source>IEEE Trans Inform Technol Biomed</source>
<year>2009</year>
<volume>13</volume>
<issue>6</issue>
<fpage>1068</fpage>
<lpage>74</lpage>
<pub-id pub-id-type="doi">10.1109/TITB.2009.2032332</pub-id>
</element-citation>
</ref>
<ref id="CR53">
<label>53</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yang</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Yin</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>X</given-names>
</name>
</person-group>
<article-title>Defect detection in magnetic tile images based on stationary wavelet transform</article-title>
<source>NDT E Int</source>
<year>2016</year>
<volume>83</volume>
<fpage>78</fpage>
<lpage>87</lpage>
<pub-id pub-id-type="doi">10.1016/j.ndteint.2016.04.006</pub-id>
</element-citation>
</ref>
<ref id="CR54">
<label>54</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lonard</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Mouchard</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Salson</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>On the number of elements to reorder when updating a suffix array</article-title>
<source>J Discret Algoritm</source>
<year>2012</year>
<volume>11</volume>
<fpage>87</fpage>
<lpage>99</lpage>
<pub-id pub-id-type="doi">10.1016/j.jda.2011.01.002</pub-id>
</element-citation>
</ref>
<ref id="CR55">
<label>55</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fowler</surname>
<given-names>J. E</given-names>
</name>
</person-group>
<article-title>The redundant discrete wavelet transform and additive noise</article-title>
<source>IEEE Signal Process Lett</source>
<year>2005</year>
<volume>12</volume>
<issue>9</issue>
<fpage>629</fpage>
<lpage>632</lpage>
<pub-id pub-id-type="doi">10.1109/LSP.2005.853048</pub-id>
</element-citation>
</ref>
<ref id="CR56">
<label>56</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Chockalingam</surname>
<given-names>SP</given-names>
</name>
<name>
<surname>Aluru</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>A survey of error-correction methods for next-generation sequencing</article-title>
<source>Brief Bioinforma</source>
<year>2013</year>
<volume>14</volume>
<issue>1</issue>
<fpage>56</fpage>
<pub-id pub-id-type="doi">10.1093/bib/bbs015</pub-id>
</element-citation>
</ref>
<ref id="CR57">
<label>57</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Needleman</surname>
<given-names>S. B</given-names>
</name>
<name>
<surname>Wunsch</surname>
<given-names>C. D</given-names>
</name>
</person-group>
<article-title>A general method applicable to the search for similarities in the amino acid sequence of two proteins</article-title>
<source>J Mole Biol</source>
<year>1970</year>
<volume>48</volume>
<issue>3</issue>
<fpage>443</fpage>
<lpage>53</lpage>
<pub-id pub-id-type="doi">10.1016/0022-2836(70)90057-4</pub-id>
</element-citation>
</ref>
<ref id="CR58">
<label>58</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wagner</surname>
<given-names>R. A</given-names>
</name>
<name>
<surname>Fischer</surname>
<given-names>M. J</given-names>
</name>
</person-group>
<article-title>The string-to-string correction problem</article-title>
<source>J ACM</source>
<year>1974</year>
<volume>21</volume>
<issue>1</issue>
<fpage>168</fpage>
<lpage>73</lpage>
<pub-id pub-id-type="doi">10.1145/321796.321811</pub-id>
</element-citation>
</ref>
<ref id="CR59">
<label>59</label>
<mixed-citation publication-type="other">Macqueen J. Some methods for classification and analysis of multivariate observations. In: Proc. of Berkeley Symposium on Mathematical Statistics and Probability: 1967. p. 281–97.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 0002728 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 0002728 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021