Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

LAF: Logic Alignment Free and its application to bacterial genomes classification

Identifieur interne : 000339 ( Pmc/Corpus ); précédent : 000338; suivant : 000340

LAF: Logic Alignment Free and its application to bacterial genomes classification

Auteurs : Emanuel Weitschek ; Fabio Cunial ; Giovanni Felici

Source :

RBID : PMC:4673791

Abstract

Alignment-free algorithms can be used to estimate the similarity of biological sequences and hence are often applied to the phylogenetic reconstruction of genomes. Most of these algorithms rely on comparing the frequency of all the distinct substrings of fixed length (k-mers) that occur in the analyzed sequences.

In this paper, we present Logic Alignment Free (LAF), a method that combines alignment-free techniques and rule-based classification algorithms in order to assign biological samples to their taxa. This method searches for a minimal subset of k-mers whose relative frequencies are used to build classification models as disjunctive-normal-form logic formulas (if-then rules).

We apply LAF successfully to the classification of bacterial genomes to their corresponding taxonomy. In particular, we succeed in obtaining reliable classification at different taxonomic levels by extracting a handful of rules, each one based on the frequency of just few k-mers.

State of the art methods to adjust the frequency of k-mers to the character distribution of the underlying genomes have negligible impact on classification performance, suggesting that the signal of each class is strong and that LAF is effective in identifying it.


Url:
DOI: 10.1186/s13040-015-0073-1
PubMed: 26664519
PubMed Central: 4673791

Links to Exploration step

PMC:4673791

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">LAF: Logic Alignment Free and its application to bacterial genomes classification</title>
<author>
<name sortKey="Weitschek, Emanuel" sort="Weitschek, Emanuel" uniqKey="Weitschek E" first="Emanuel" last="Weitschek">Emanuel Weitschek</name>
<affiliation>
<nlm:aff id="Aff1">Department of Engineering, Uninettuno International University, Corso Vittorio Emanuele II, 39, Rome, 00186 Italy</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">Institute of Systems Analysis and Computer Science “A. Ruberti”, National Research Council, Via dei Taurini 19, Rome, 00185 Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Cunial, Fabio" sort="Cunial, Fabio" uniqKey="Cunial F" first="Fabio" last="Cunial">Fabio Cunial</name>
<affiliation>
<nlm:aff id="Aff2">Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, P.O. Box 68 (Gustaf Hällströmin katu 2b), Helsinki, FI-00014 Finland</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Felici, Giovanni" sort="Felici, Giovanni" uniqKey="Felici G" first="Giovanni" last="Felici">Giovanni Felici</name>
<affiliation>
<nlm:aff id="Aff3">Institute of Systems Analysis and Computer Science “A. Ruberti”, National Research Council, Via dei Taurini 19, Rome, 00185 Italy</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">26664519</idno>
<idno type="pmc">4673791</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4673791</idno>
<idno type="RBID">PMC:4673791</idno>
<idno type="doi">10.1186/s13040-015-0073-1</idno>
<date when="2015">2015</date>
<idno type="wicri:Area/Pmc/Corpus">000339</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000339</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">LAF: Logic Alignment Free and its application to bacterial genomes classification</title>
<author>
<name sortKey="Weitschek, Emanuel" sort="Weitschek, Emanuel" uniqKey="Weitschek E" first="Emanuel" last="Weitschek">Emanuel Weitschek</name>
<affiliation>
<nlm:aff id="Aff1">Department of Engineering, Uninettuno International University, Corso Vittorio Emanuele II, 39, Rome, 00186 Italy</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff3">Institute of Systems Analysis and Computer Science “A. Ruberti”, National Research Council, Via dei Taurini 19, Rome, 00185 Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Cunial, Fabio" sort="Cunial, Fabio" uniqKey="Cunial F" first="Fabio" last="Cunial">Fabio Cunial</name>
<affiliation>
<nlm:aff id="Aff2">Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, P.O. Box 68 (Gustaf Hällströmin katu 2b), Helsinki, FI-00014 Finland</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Felici, Giovanni" sort="Felici, Giovanni" uniqKey="Felici G" first="Giovanni" last="Felici">Giovanni Felici</name>
<affiliation>
<nlm:aff id="Aff3">Institute of Systems Analysis and Computer Science “A. Ruberti”, National Research Council, Via dei Taurini 19, Rome, 00185 Italy</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BioData Mining</title>
<idno type="eISSN">1756-0381</idno>
<imprint>
<date when="2015">2015</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>Alignment-free algorithms can be used to estimate the similarity of biological sequences and hence are often applied to the phylogenetic reconstruction of genomes. Most of these algorithms rely on comparing the frequency of all the distinct substrings of fixed length (
<italic>k</italic>
-mers) that occur in the analyzed sequences.</p>
<p>In this paper, we present Logic Alignment Free (
<sc>LAF</sc>
), a method that combines alignment-free techniques and rule-based classification algorithms in order to assign biological samples to their taxa. This method searches for a minimal subset of
<italic>k</italic>
-mers whose relative frequencies are used to build classification models as disjunctive-normal-form logic formulas (
<italic>if-then rules</italic>
).</p>
<p>We apply
<sc>LAF</sc>
successfully to the classification of bacterial genomes to their corresponding taxonomy. In particular, we succeed in obtaining reliable classification at different taxonomic levels by extracting a handful of rules, each one based on the frequency of just few
<italic>k</italic>
-mers.</p>
<p>State of the art methods to adjust the frequency of
<italic>k</italic>
-mers to the character distribution of the underlying genomes have negligible impact on classification performance, suggesting that the signal of each class is strong and that
<sc>LAF</sc>
is effective in identifying it.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Pearson, Wr" uniqKey="Pearson W">WR Pearson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Needleman, Sb" uniqKey="Needleman S">SB Needleman</name>
</author>
<author>
<name sortKey="Wunsch, Cd" uniqKey="Wunsch C">CD Wunsch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pearson, Wr" uniqKey="Pearson W">WR Pearson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author>
<name sortKey="Madden, Tl" uniqKey="Madden T">TL Madden</name>
</author>
<author>
<name sortKey="Schaffer, Aa" uniqKey="Schaffer A">AA Schaffer</name>
</author>
<author>
<name sortKey="Zhang, J" uniqKey="Zhang J">J Zhang</name>
</author>
<author>
<name sortKey="Zhang, Z" uniqKey="Zhang Z">Z Zhang</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Edgar, Rc" uniqKey="Edgar R">RC Edgar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Thompson, Jd" uniqKey="Thompson J">JD Thompson</name>
</author>
<author>
<name sortKey="Gibson, T" uniqKey="Gibson T">T Gibson</name>
</author>
<author>
<name sortKey="Higgins, Dg" uniqKey="Higgins D">DG Higgins</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mokaddem, A" uniqKey="Mokaddem A">A Mokaddem</name>
</author>
<author>
<name sortKey="Elloumi, M" uniqKey="Elloumi M">M Elloumi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Katoh, K" uniqKey="Katoh K">K Katoh</name>
</author>
<author>
<name sortKey="Misawa, K" uniqKey="Misawa K">K Misawa</name>
</author>
<author>
<name sortKey="Kuma, K I" uniqKey="Kuma K">K-i Kuma</name>
</author>
<author>
<name sortKey="Miyata, T" uniqKey="Miyata T">T Miyata</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vinga, S" uniqKey="Vinga S">S Vinga</name>
</author>
<author>
<name sortKey="Almeida, J" uniqKey="Almeida J">J Almeida</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Delcher, Al" uniqKey="Delcher A">AL Delcher</name>
</author>
<author>
<name sortKey="Kasif, S" uniqKey="Kasif S">S Kasif</name>
</author>
<author>
<name sortKey="Fleischmann, Rd" uniqKey="Fleischmann R">RD Fleischmann</name>
</author>
<author>
<name sortKey="Peterson, J" uniqKey="Peterson J">J Peterson</name>
</author>
<author>
<name sortKey="White, O" uniqKey="White O">O White</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, M" uniqKey="Li M">M Li</name>
</author>
<author>
<name sortKey="Vitnyi, Pmb" uniqKey="Vitnyi P">PMB Vitnyi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Almeida, Js" uniqKey="Almeida J">JS Almeida</name>
</author>
<author>
<name sortKey="Vinga, S" uniqKey="Vinga S">S Vinga</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vinga, S" uniqKey="Vinga S">S Vinga</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vinga, S" uniqKey="Vinga S">S Vinga</name>
</author>
<author>
<name sortKey="Almeida, J" uniqKey="Almeida J">J Almeida</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bentley, Sd" uniqKey="Bentley S">SD Bentley</name>
</author>
<author>
<name sortKey="Parkhill, J" uniqKey="Parkhill J">J Parkhill</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Josse, J" uniqKey="Josse J">J Josse</name>
</author>
<author>
<name sortKey="Kaiser, A" uniqKey="Kaiser A">A Kaiser</name>
</author>
<author>
<name sortKey="Kornberg, A" uniqKey="Kornberg A">A Kornberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Trautner, T" uniqKey="Trautner T">T Trautner</name>
</author>
<author>
<name sortKey="Swartz, M" uniqKey="Swartz M">M Swartz</name>
</author>
<author>
<name sortKey="Kornberg, A" uniqKey="Kornberg A">A Kornberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Russell, G" uniqKey="Russell G">G Russell</name>
</author>
<author>
<name sortKey="Walker, P" uniqKey="Walker P">P Walker</name>
</author>
<author>
<name sortKey="Elton, R" uniqKey="Elton R">R Elton</name>
</author>
<author>
<name sortKey="Subak Sharpe, J" uniqKey="Subak Sharpe J">J Subak-Sharpe</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Russell, G" uniqKey="Russell G">G Russell</name>
</author>
<author>
<name sortKey="Subak Sharpe, J" uniqKey="Subak Sharpe J">J Subak-Sharpe</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Karlin, S" uniqKey="Karlin S">S Karlin</name>
</author>
<author>
<name sortKey="Burge, C" uniqKey="Burge C">C Burge</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Karlin, S" uniqKey="Karlin S">S Karlin</name>
</author>
<author>
<name sortKey="Mrazek, J" uniqKey="Mrazek J">J Mrázek</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Teeling, H" uniqKey="Teeling H">H Teeling</name>
</author>
<author>
<name sortKey="Meyerdierks, A" uniqKey="Meyerdierks A">A Meyerdierks</name>
</author>
<author>
<name sortKey="Bauer, M" uniqKey="Bauer M">M Bauer</name>
</author>
<author>
<name sortKey="Amann, R" uniqKey="Amann R">R Amann</name>
</author>
<author>
<name sortKey="Glockner, Fo" uniqKey="Glockner F">FO Glöckner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhou, F" uniqKey="Zhou F">F Zhou</name>
</author>
<author>
<name sortKey="Olman, V" uniqKey="Olman V">V Olman</name>
</author>
<author>
<name sortKey="Xu, Y" uniqKey="Xu Y">Y Xu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Deschavanne, Pj" uniqKey="Deschavanne P">PJ Deschavanne</name>
</author>
<author>
<name sortKey="Giron, A" uniqKey="Giron A">A Giron</name>
</author>
<author>
<name sortKey="Vilain, J" uniqKey="Vilain J">J Vilain</name>
</author>
<author>
<name sortKey="Fagot, G" uniqKey="Fagot G">G Fagot</name>
</author>
<author>
<name sortKey="Fertil, B" uniqKey="Fertil B">B Fertil</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sandberg, R" uniqKey="Sandberg R">R Sandberg</name>
</author>
<author>
<name sortKey="Winberg, G" uniqKey="Winberg G">G Winberg</name>
</author>
<author>
<name sortKey="Br Nden, Ci" uniqKey="Br Nden C">CI Bränden</name>
</author>
<author>
<name sortKey="Kaske, A" uniqKey="Kaske A">A Kaske</name>
</author>
<author>
<name sortKey="Ernberg, I" uniqKey="Ernberg I">I Ernberg</name>
</author>
<author>
<name sortKey="Coster, J" uniqKey="Coster J">J Cöster</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pride, Dt" uniqKey="Pride D">DT Pride</name>
</author>
<author>
<name sortKey="Meinersmann, Rj" uniqKey="Meinersmann R">RJ Meinersmann</name>
</author>
<author>
<name sortKey="Wassenaar, Tm" uniqKey="Wassenaar T">TM Wassenaar</name>
</author>
<author>
<name sortKey="Blaser, Mj" uniqKey="Blaser M">MJ Blaser</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gatherer, D" uniqKey="Gatherer D">D Gatherer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Takahashi, M" uniqKey="Takahashi M">M Takahashi</name>
</author>
<author>
<name sortKey="Kryukov, K" uniqKey="Kryukov K">K Kryukov</name>
</author>
<author>
<name sortKey="Saitou, N" uniqKey="Saitou N">N Saitou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Teeling, H" uniqKey="Teeling H">H Teeling</name>
</author>
<author>
<name sortKey="Waldmann, J" uniqKey="Waldmann J">J Waldmann</name>
</author>
<author>
<name sortKey="Lombardot, T" uniqKey="Lombardot T">T Lombardot</name>
</author>
<author>
<name sortKey="Bauer, M" uniqKey="Bauer M">M Bauer</name>
</author>
<author>
<name sortKey="Glockner, Fo" uniqKey="Glockner F">FO Glockner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rigoutsos, I" uniqKey="Rigoutsos I">I Rigoutsos</name>
</author>
<author>
<name sortKey="Floratos, A" uniqKey="Floratos A">A Floratos</name>
</author>
<author>
<name sortKey="Ouzounis, C" uniqKey="Ouzounis C">C Ouzounis</name>
</author>
<author>
<name sortKey="Gao, Y" uniqKey="Gao Y">Y Gao</name>
</author>
<author>
<name sortKey="Parida, L" uniqKey="Parida L">L Parida</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chor, B" uniqKey="Chor B">B Chor</name>
</author>
<author>
<name sortKey="Horn, D" uniqKey="Horn D">D Horn</name>
</author>
<author>
<name sortKey="Goldman, N" uniqKey="Goldman N">N Goldman</name>
</author>
<author>
<name sortKey="Levy, Y" uniqKey="Levy Y">Y Levy</name>
</author>
<author>
<name sortKey="Massingham, T" uniqKey="Massingham T">T Massingham</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="O Ul, H" uniqKey="O Ul H">H Oğul</name>
</author>
<author>
<name sortKey="Mumcuo Lu, Eu" uniqKey="Mumcuo Lu E">EÜ Mumcuoğlu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Karlin, S" uniqKey="Karlin S">S Karlin</name>
</author>
<author>
<name sortKey="Mrazek, J" uniqKey="Mrazek J">J Mrazek</name>
</author>
<author>
<name sortKey="Campbell, Am" uniqKey="Campbell A">AM Campbell</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Foerstner, Ku" uniqKey="Foerstner K">KU Foerstner</name>
</author>
<author>
<name sortKey="Von Mering, C" uniqKey="Von Mering C">C von Mering</name>
</author>
<author>
<name sortKey="Hooper, Sd" uniqKey="Hooper S">SD Hooper</name>
</author>
<author>
<name sortKey="Bork, P" uniqKey="Bork P">P Bork</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mchardy, Ac" uniqKey="Mchardy A">AC McHardy</name>
</author>
<author>
<name sortKey="Martin, Hg" uniqKey="Martin H">HG Martín</name>
</author>
<author>
<name sortKey="Tsirigos, A" uniqKey="Tsirigos A">A Tsirigos</name>
</author>
<author>
<name sortKey="Hugenholtz, P" uniqKey="Hugenholtz P">P Hugenholtz</name>
</author>
<author>
<name sortKey="Rigoutsos, I" uniqKey="Rigoutsos I">I Rigoutsos</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chatterji, S" uniqKey="Chatterji S">S Chatterji</name>
</author>
<author>
<name sortKey="Yamazaki, I" uniqKey="Yamazaki I">I Yamazaki</name>
</author>
<author>
<name sortKey="Bai, Z" uniqKey="Bai Z">Z Bai</name>
</author>
<author>
<name sortKey="Eisen, Ja" uniqKey="Eisen J">JA Eisen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Leung, Hc" uniqKey="Leung H">HC Leung</name>
</author>
<author>
<name sortKey="Yiu, S" uniqKey="Yiu S">S Yiu</name>
</author>
<author>
<name sortKey="Yang, B" uniqKey="Yang B">B Yang</name>
</author>
<author>
<name sortKey="Peng, Y" uniqKey="Peng Y">Y Peng</name>
</author>
<author>
<name sortKey="Wang, Y" uniqKey="Wang Y">Y Wang</name>
</author>
<author>
<name sortKey="Liu, Z" uniqKey="Liu Z">Z Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wang, Y" uniqKey="Wang Y">Y Wang</name>
</author>
<author>
<name sortKey="Leung, Hc" uniqKey="Leung H">HC Leung</name>
</author>
<author>
<name sortKey="Yiu, S" uniqKey="Yiu S">S Yiu</name>
</author>
<author>
<name sortKey="Chin, Fy" uniqKey="Chin F">FY Chin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tanaseichuk, O" uniqKey="Tanaseichuk O">O Tanaseichuk</name>
</author>
<author>
<name sortKey="Borneman, J" uniqKey="Borneman J">J Borneman</name>
</author>
<author>
<name sortKey="Jiang, T" uniqKey="Jiang T">T Jiang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Song, K" uniqKey="Song K">K Song</name>
</author>
<author>
<name sortKey="Ren, J" uniqKey="Ren J">J Ren</name>
</author>
<author>
<name sortKey="Zhai, Z" uniqKey="Zhai Z">Z Zhai</name>
</author>
<author>
<name sortKey="Liu, X" uniqKey="Liu X">X Liu</name>
</author>
<author>
<name sortKey="Deng, M" uniqKey="Deng M">M Deng</name>
</author>
<author>
<name sortKey="Sun, F" uniqKey="Sun F">F Sun</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stuart, Gw" uniqKey="Stuart G">GW Stuart</name>
</author>
<author>
<name sortKey="Moffett, K" uniqKey="Moffett K">K Moffett</name>
</author>
<author>
<name sortKey="Baker, S" uniqKey="Baker S">S Baker</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stuart, Gw" uniqKey="Stuart G">GW Stuart</name>
</author>
<author>
<name sortKey="Moffett, K" uniqKey="Moffett K">K Moffett</name>
</author>
<author>
<name sortKey="Leader, Jj" uniqKey="Leader J">JJ Leader</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Comin, M" uniqKey="Comin M">M Comin</name>
</author>
<author>
<name sortKey="Verzotto, D" uniqKey="Verzotto D">D Verzotto</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kuksa, P" uniqKey="Kuksa P">P Kuksa</name>
</author>
<author>
<name sortKey="Pavlovic, V" uniqKey="Pavlovic V">V Pavlovic</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Solovyev, Vv" uniqKey="Solovyev V">VV Solovyev</name>
</author>
<author>
<name sortKey="Makarova, Ks" uniqKey="Makarova K">KS Makarova</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ratnasingham, S" uniqKey="Ratnasingham S">S Ratnasingham</name>
</author>
<author>
<name sortKey="Hebert, Pdn" uniqKey="Hebert P">PDN Hebert</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, B" uniqKey="Liu B">B Liu</name>
</author>
<author>
<name sortKey="Gibbons, T" uniqKey="Gibbons T">T Gibbons</name>
</author>
<author>
<name sortKey="Ghodsi, M" uniqKey="Ghodsi M">M Ghodsi</name>
</author>
<author>
<name sortKey="Treangen, T" uniqKey="Treangen T">T Treangen</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Segata, N" uniqKey="Segata N">N Segata</name>
</author>
<author>
<name sortKey="Waldron, L" uniqKey="Waldron L">L Waldron</name>
</author>
<author>
<name sortKey="Ballarini, A" uniqKey="Ballarini A">A Ballarini</name>
</author>
<author>
<name sortKey="Narasimhan, V" uniqKey="Narasimhan V">V Narasimhan</name>
</author>
<author>
<name sortKey="Jousson, O" uniqKey="Jousson O">O Jousson</name>
</author>
<author>
<name sortKey="Huttenhower, C" uniqKey="Huttenhower C">C Huttenhower</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Edwards, Ra" uniqKey="Edwards R">RA Edwards</name>
</author>
<author>
<name sortKey="Olson, R" uniqKey="Olson R">R Olson</name>
</author>
<author>
<name sortKey="Disz, T" uniqKey="Disz T">T Disz</name>
</author>
<author>
<name sortKey="Pusch, Gd" uniqKey="Pusch G">GD Pusch</name>
</author>
<author>
<name sortKey="Vonstein, V" uniqKey="Vonstein V">V Vonstein</name>
</author>
<author>
<name sortKey="Stevens, R" uniqKey="Stevens R">R Stevens</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Seth, S" uniqKey="Seth S">S Seth</name>
</author>
<author>
<name sortKey="V Lim Ki, N" uniqKey="V Lim Ki N">N Välimäki</name>
</author>
<author>
<name sortKey="Kaski, S" uniqKey="Kaski S">S Kaski</name>
</author>
<author>
<name sortKey="Honkela, A" uniqKey="Honkela A">A Honkela</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Weitschek, E" uniqKey="Weitschek E">E Weitschek</name>
</author>
<author>
<name sortKey="Fiscon, G" uniqKey="Fiscon G">G Fiscon</name>
</author>
<author>
<name sortKey="Felici, G" uniqKey="Felici G">G Felici</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lehr, T" uniqKey="Lehr T">T Lehr</name>
</author>
<author>
<name sortKey="Yuan, J" uniqKey="Yuan J">J Yuan</name>
</author>
<author>
<name sortKey="Zeumer, D" uniqKey="Zeumer D">D Zeumer</name>
</author>
<author>
<name sortKey="Jayadev, S" uniqKey="Jayadev S">S Jayadev</name>
</author>
<author>
<name sortKey="Ritchie, M" uniqKey="Ritchie M">M Ritchie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Polychronopoulos, D" uniqKey="Polychronopoulos D">D Polychronopoulos</name>
</author>
<author>
<name sortKey="Weitschek, E" uniqKey="Weitschek E">E Weitschek</name>
</author>
<author>
<name sortKey="Dimitrieva, S" uniqKey="Dimitrieva S">S Dimitrieva</name>
</author>
<author>
<name sortKey="Bucher, P" uniqKey="Bucher P">P Bucher</name>
</author>
<author>
<name sortKey="Felici, G" uniqKey="Felici G">G Felici</name>
</author>
<author>
<name sortKey="Almirantis, Y" uniqKey="Almirantis Y">Y Almirantis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kudenko, D" uniqKey="Kudenko D">D Kudenko</name>
</author>
<author>
<name sortKey="Hirsh, H" uniqKey="Hirsh H">H Hirsh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ben Hur, A" uniqKey="Ben Hur A">A Ben-Hur</name>
</author>
<author>
<name sortKey="Brutlag, D" uniqKey="Brutlag D">D Brutlag</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Xing, Z" uniqKey="Xing Z">Z Xing</name>
</author>
<author>
<name sortKey="Pei, J" uniqKey="Pei J">J Pei</name>
</author>
<author>
<name sortKey="Keogh, E" uniqKey="Keogh E">E Keogh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kuksa, P" uniqKey="Kuksa P">P Kuksa</name>
</author>
<author>
<name sortKey="Pavlovic, V" uniqKey="Pavlovic V">V Pavlovic</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vapnik, Vn" uniqKey="Vapnik V">VN Vapnik</name>
</author>
<author>
<name sortKey="Vapnik, V" uniqKey="Vapnik V">V Vapnik</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bertolazzi, P" uniqKey="Bertolazzi P">P Bertolazzi</name>
</author>
<author>
<name sortKey="Felici, G" uniqKey="Felici G">G Felici</name>
</author>
<author>
<name sortKey="Weitschek, E" uniqKey="Weitschek E">E Weitschek</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Weitschek, E" uniqKey="Weitschek E">E Weitschek</name>
</author>
<author>
<name sortKey="Lo Presti, A" uniqKey="Lo Presti A">A Lo Presti</name>
</author>
<author>
<name sortKey="Drovandi, G" uniqKey="Drovandi G">G Drovandi</name>
</author>
<author>
<name sortKey="Felici, G" uniqKey="Felici G">G Felici</name>
</author>
<author>
<name sortKey="Ciccozzi, M" uniqKey="Ciccozzi M">M Ciccozzi</name>
</author>
<author>
<name sortKey="Ciotti, M" uniqKey="Ciotti M">M Ciotti</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gaines, Br" uniqKey="Gaines B">BR Gaines</name>
</author>
<author>
<name sortKey="Compton, P" uniqKey="Compton P">P Compton</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Frank, E" uniqKey="Frank E">E Frank</name>
</author>
<author>
<name sortKey="Witten, Ih" uniqKey="Witten I">IH Witten</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cohen, Ww" uniqKey="Cohen W">WW Cohen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Felici, G" uniqKey="Felici G">G Felici</name>
</author>
<author>
<name sortKey="Truemper, K" uniqKey="Truemper K">K Truemper</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bertolazzi, P" uniqKey="Bertolazzi P">P Bertolazzi</name>
</author>
<author>
<name sortKey="Felici, G" uniqKey="Felici G">G Felici</name>
</author>
<author>
<name sortKey="Weitschek, E" uniqKey="Weitschek E">E Weitschek</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Quinlan, Jr" uniqKey="Quinlan J">JR Quinlan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hall, M" uniqKey="Hall M">M Hall</name>
</author>
<author>
<name sortKey="Frank, E" uniqKey="Frank E">E Frank</name>
</author>
<author>
<name sortKey="Holmes, G" uniqKey="Holmes G">G Holmes</name>
</author>
<author>
<name sortKey="Pfahringer, B" uniqKey="Pfahringer B">B Pfahringer</name>
</author>
<author>
<name sortKey="Reutemann, P" uniqKey="Reutemann P">P Reutemann</name>
</author>
<author>
<name sortKey="Witten, Ih" uniqKey="Witten I">IH Witten</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G Marcais</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dasarathy, Bv" uniqKey="Dasarathy B">BV Dasarathy</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Teeling, H" uniqKey="Teeling H">H Teeling</name>
</author>
<author>
<name sortKey="Meyerdiekers, A" uniqKey="Meyerdiekers A">A Meyerdiekers</name>
</author>
<author>
<name sortKey="Bauer, M" uniqKey="Bauer M">M Bauer</name>
</author>
<author>
<name sortKey="Glockner, Fo" uniqKey="Glockner F">FO Glockner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pride, Dt" uniqKey="Pride D">DT Pride</name>
</author>
<author>
<name sortKey="Meinersmann, Rj" uniqKey="Meinersmann R">RJ Meinersmann</name>
</author>
<author>
<name sortKey="Wassenaar, Tm" uniqKey="Wassenaar T">TM Wassenaar</name>
</author>
<author>
<name sortKey="Blaser, Mj" uniqKey="Blaser M">MJ Blaser</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Teeling, H" uniqKey="Teeling H">H Teeling</name>
</author>
<author>
<name sortKey="Waldmann, J" uniqKey="Waldmann J">J Waldmann</name>
</author>
<author>
<name sortKey="Lombardot, T" uniqKey="Lombardot T">T Lombardot</name>
</author>
<author>
<name sortKey="Bauer, M" uniqKey="Bauer M">M Bauer</name>
</author>
<author>
<name sortKey="Glockner, Fo" uniqKey="Glockner F">FO Glockner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chan, Rh" uniqKey="Chan R">RH Chan</name>
</author>
<author>
<name sortKey="Chan, Th" uniqKey="Chan T">TH Chan</name>
</author>
<author>
<name sortKey="Yeung, Hm" uniqKey="Yeung H">HM Yeung</name>
</author>
<author>
<name sortKey="Wang, Rw" uniqKey="Wang R">RW Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Qi, J" uniqKey="Qi J">J Qi</name>
</author>
<author>
<name sortKey="Wang, B" uniqKey="Wang B">B Wang</name>
</author>
<author>
<name sortKey="Hao, Bi" uniqKey="Hao B">BI Hao</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yu, Zg" uniqKey="Yu Z">ZG Yu</name>
</author>
<author>
<name sortKey="Zhou, Lq" uniqKey="Zhou L">LQ Zhou</name>
</author>
<author>
<name sortKey="Anh, Vv" uniqKey="Anh V">VV Anh</name>
</author>
<author>
<name sortKey="Chu, Kh" uniqKey="Chu K">KH Chu</name>
</author>
<author>
<name sortKey="Long, Sc" uniqKey="Long S">SC Long</name>
</author>
<author>
<name sortKey="Deng, Jq" uniqKey="Deng J">JQ Deng</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Song, K" uniqKey="Song K">K Song</name>
</author>
<author>
<name sortKey="Ren, J" uniqKey="Ren J">J Ren</name>
</author>
<author>
<name sortKey="Reinert, G" uniqKey="Reinert G">G Reinert</name>
</author>
<author>
<name sortKey="Deng, M" uniqKey="Deng M">M Deng</name>
</author>
<author>
<name sortKey="Waterman, Ms" uniqKey="Waterman M">MS Waterman</name>
</author>
<author>
<name sortKey="Sun, F" uniqKey="Sun F">F Sun</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huang, K" uniqKey="Huang K">K Huang</name>
</author>
<author>
<name sortKey="Brady, A" uniqKey="Brady A">A Brady</name>
</author>
<author>
<name sortKey="Mahurkar, A" uniqKey="Mahurkar A">A Mahurkar</name>
</author>
<author>
<name sortKey="White, O" uniqKey="White O">O White</name>
</author>
<author>
<name sortKey="Gevers, D" uniqKey="Gevers D">D Gevers</name>
</author>
<author>
<name sortKey="Huttenhower, C" uniqKey="Huttenhower C">C Huttenhower</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BioData Min</journal-id>
<journal-id journal-id-type="iso-abbrev">BioData Min</journal-id>
<journal-title-group>
<journal-title>BioData Mining</journal-title>
</journal-title-group>
<issn pub-type="epub">1756-0381</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">26664519</article-id>
<article-id pub-id-type="pmc">4673791</article-id>
<article-id pub-id-type="publisher-id">73</article-id>
<article-id pub-id-type="doi">10.1186/s13040-015-0073-1</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Methodology</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>LAF: Logic Alignment Free and its application to bacterial genomes classification</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Weitschek</surname>
<given-names>Emanuel</given-names>
</name>
<address>
<email>emanuel@iasi.cnr.it</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
<xref ref-type="aff" rid="Aff3">3</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Cunial</surname>
<given-names>Fabio</given-names>
</name>
<address>
<email>cunial@cs.helsinki.fi</email>
</address>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Felici</surname>
<given-names>Giovanni</given-names>
</name>
<address>
<email>giovanni.felici@iasi.cnr.it</email>
</address>
<xref ref-type="aff" rid="Aff3">3</xref>
</contrib>
<aff id="Aff1">
<label>1</label>
Department of Engineering, Uninettuno International University, Corso Vittorio Emanuele II, 39, Rome, 00186 Italy</aff>
<aff id="Aff2">
<label>2</label>
Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, P.O. Box 68 (Gustaf Hällströmin katu 2b), Helsinki, FI-00014 Finland</aff>
<aff id="Aff3">
<label>3</label>
Institute of Systems Analysis and Computer Science “A. Ruberti”, National Research Council, Via dei Taurini 19, Rome, 00185 Italy</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>8</day>
<month>12</month>
<year>2015</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>8</day>
<month>12</month>
<year>2015</year>
</pub-date>
<pub-date pub-type="collection">
<year>2015</year>
</pub-date>
<volume>8</volume>
<elocation-id>39</elocation-id>
<history>
<date date-type="received">
<day>30</day>
<month>3</month>
<year>2015</year>
</date>
<date date-type="accepted">
<day>30</day>
<month>11</month>
<year>2015</year>
</date>
</history>
<permissions>
<copyright-statement>© Weitschek et al. 2015</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<p>Alignment-free algorithms can be used to estimate the similarity of biological sequences and hence are often applied to the phylogenetic reconstruction of genomes. Most of these algorithms rely on comparing the frequency of all the distinct substrings of fixed length (
<italic>k</italic>
-mers) that occur in the analyzed sequences.</p>
<p>In this paper, we present Logic Alignment Free (
<sc>LAF</sc>
), a method that combines alignment-free techniques and rule-based classification algorithms in order to assign biological samples to their taxa. This method searches for a minimal subset of
<italic>k</italic>
-mers whose relative frequencies are used to build classification models as disjunctive-normal-form logic formulas (
<italic>if-then rules</italic>
).</p>
<p>We apply
<sc>LAF</sc>
successfully to the classification of bacterial genomes to their corresponding taxonomy. In particular, we succeed in obtaining reliable classification at different taxonomic levels by extracting a handful of rules, each one based on the frequency of just few
<italic>k</italic>
-mers.</p>
<p>State of the art methods to adjust the frequency of
<italic>k</italic>
-mers to the character distribution of the underlying genomes have negligible impact on classification performance, suggesting that the signal of each class is strong and that
<sc>LAF</sc>
is effective in identifying it.</p>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Supervised classification</kwd>
<kwd>Alignment-free sequence comparison</kwd>
<kwd>Bacterial taxonomy</kwd>
</kwd-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2015</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Background</title>
<p>The field of biological sequence analysis relies on mathematical, statistical, and computer science methods for discovering similarities among different organisms, understanding their features and their structure, detecting ancestry, relatedness, evolution, and common functions.</p>
<p>Several well-established sequence comparison algorithms are based on sequence alignment: they compute sequence similarity by aligning portions of sequences (e.g., subsequences) that have common nucleotide assignments. The alignments of two or more sequences are scored according to the number of common nucleotides. Such methods can be exact or heuristic. Among exact methods, Smith-Waterman [
<xref ref-type="bibr" rid="CR1">1</xref>
] and Needleman-Wunsch [
<xref ref-type="bibr" rid="CR2">2</xref>
] use dynamic programming techniques. The first performs local sequence alignment: it detects the common regions between two sequences by comparing segments of all possible lengths. The second is a global alignment algorithm, designed to align entire sequences. In order to reduce the computational burden of exact methods, several heuristic algorithms have been designed, the most renowned being FASTA [
<xref ref-type="bibr" rid="CR3">3</xref>
] and BLAST [
<xref ref-type="bibr" rid="CR4">4</xref>
]. For the comparisons of more than two sequences, there are ad-hoc algorithms like Muscle [
<xref ref-type="bibr" rid="CR5">5</xref>
], ClustalW [
<xref ref-type="bibr" rid="CR6">6</xref>
], Motalign [
<xref ref-type="bibr" rid="CR7">7</xref>
], and Mafft [
<xref ref-type="bibr" rid="CR8">8</xref>
]. Alignment-based sequence analysis algorithms have a very high computational cost, especially when applied to a large set of sequences [
<xref ref-type="bibr" rid="CR9">9</xref>
]. Other problems may also be encountered when performing alignment on genome sequences, related with the presence of non-coding subsequences, or simply with the computational burden associated with the alignment of whole genomes [
<xref ref-type="bibr" rid="CR10">10</xref>
].</p>
<p>In order to address these issues, alignment-free sequence analysis methods can be considered. Such algorithms are mainly classified in two groups: methods based on sequence compression and methods that rely on the frequencies of the subsequences (oligomers) [
<xref ref-type="bibr" rid="CR9">9</xref>
].</p>
<p>The first class of methods compute a model that succinctly describes the sequence, and assess the similarity of the sequences by analyzing their compressed representations, e.g., Kolomogorov complexity [
<xref ref-type="bibr" rid="CR11">11</xref>
] or Universal Sequence Maps [
<xref ref-type="bibr" rid="CR12">12</xref>
].</p>
<p>In this work we focus on the second class of methods, alignment-free algorithms that rely on oligomer frequencies and map two strings
<italic>X</italic>
and
<italic>Y</italic>
onto corresponding multidimensional vectors
<bold>X</bold>
and
<bold>Y</bold>
; these vectors are indexed by a number of substrings in the given alphabet (a typical case is when all possible substrings of a predefined length
<italic>k</italic>
are used).
<bold>X</bold>
[
<italic>W</italic>
] and
<bold>Y</bold>
[
<italic>W</italic>
] – the element of
<bold>X</bold>
and
<bold>Y</bold>
associated with substring
<italic>W</italic>
– contain the number of occurrences of
<italic>W</italic>
in
<italic>X</italic>
and
<italic>Y</italic>
respectively. Often the number of occurrences is normalized and converted into a measure of statistical surprise using the length and distribution of characters in each string. Standard distance functions on vectors are then applied to
<bold>X</bold>
and
<bold>Y</bold>
, allowing the original strings to be compared by classical distance-based algorithms.</p>
<p>Alignment-free algorithms are currently the most scalable class of methods for reconstructing phylogenetic trees from thousands of large, distantly-related genomes and proteomes [
<xref ref-type="bibr" rid="CR13">13</xref>
,
<xref ref-type="bibr" rid="CR14">14</xref>
].</p>
<p>The success of alignment-free methods rests on extensive information on the substring composition of genomes and on codon-usage biases, cumulated over approximately fifty years, with particular emphasis on prokaryotes: from the first studies of
<sc>GC</sc>
content [
<xref ref-type="bibr" rid="CR15">15</xref>
], to the first detection of biases in the composition of pairs and quadruples of adjacent nucleotides [
<xref ref-type="bibr" rid="CR15">15</xref>
<xref ref-type="bibr" rid="CR21">21</xref>
], to the discovery of species-specific frequencies of 4-mers and 8-mers preserved in
<sc>DNA</sc>
fragments ranging from 40 kilobases to 400 bases [
<xref ref-type="bibr" rid="CR22">22</xref>
<xref ref-type="bibr" rid="CR26">26</xref>
], to more recent, unsupervised classifications [
<xref ref-type="bibr" rid="CR27">27</xref>
<xref ref-type="bibr" rid="CR29">29</xref>
] and more complex protein motifs [
<xref ref-type="bibr" rid="CR30">30</xref>
].</p>
<p>Since the very beginning, most such studies have relied on some form of noise filtration, either assuming an independent and identically distributed source or a Markov source of low order (i.e., normalizing the raw frequencies using their expectation and or variance according to the specified sources). Markov chains inferred from genomes have indeed been shown to reproduce large fractions of the frequency distribution of
<italic>k</italic>
-mers in the original genomes [
<xref ref-type="bibr" rid="CR23">23</xref>
,
<xref ref-type="bibr" rid="CR31">31</xref>
,
<xref ref-type="bibr" rid="CR32">32</xref>
].</p>
<p>So far, classification has always relied on the frequency of
<italic>all</italic>
<italic>k</italic>
-mers [
<xref ref-type="bibr" rid="CR27">27</xref>
,
<xref ref-type="bibr" rid="CR33">33</xref>
], and minimality in phylogenetic signal has been investigated with respect to the length of the strings from which
<italic>k</italic>
-mers are extracted, rather than to the space of features used for classification. This trend continues in modern applications of
<italic>k</italic>
-mer composition to annotating and binning metagenomic reads [
<xref ref-type="bibr" rid="CR34">34</xref>
]: increasingly more sophisticated heuristics have allowed to reliably classify reads ranging from one kilobase to 75 bases, under a variety of species abundance scenarios [
<xref ref-type="bibr" rid="CR35">35</xref>
<xref ref-type="bibr" rid="CR40">40</xref>
]. However, fundamental questions on the distribution and concentration of phylogenetic signal in the space of all
<italic>k</italic>
-mers are still open and scarcely investigated. Among the few attempts in this direction, we mention the use of singular value decomposition (
<sc>SVD</sc>
[
<xref ref-type="bibr" rid="CR41">41</xref>
,
<xref ref-type="bibr" rid="CR42">42</xref>
]) and of irredundant shared substrings [
<xref ref-type="bibr" rid="CR43">43</xref>
] in phylogeny reconstruction, the use of few selected
<italic>k</italic>
-mers in barcoding genes [
<xref ref-type="bibr" rid="CR44">44</xref>
], and early attempts at classifying protein families using the frequency of a small set of dipeptides [
<xref ref-type="bibr" rid="CR45">45</xref>
].</p>
<p>In this work, we search for
<italic>a minimal set of k-mers whose frequency is sufficient to classify entire genomes</italic>
. Specifically, we focus on
<italic>logic formulas</italic>
(
<italic>if-then rules</italic>
) whose attributes
<italic>W</italic>
are
<italic>k</italic>
-mers, and whose values
<italic>f</italic>
<sub>
<italic>X</italic>
</sub>
(
<italic>W</italic>
) are relative frequencies in a genome
<italic>X</italic>
, possibly corrected by expected counts. An example of such a formula could be:
<disp-formula id="Equa">
<alternatives>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\text{if}\,\,(f(\texttt{ACGT})>0.15) \wedge (f(\texttt{GGCT})<0.6)\,\,\text{then}\,\,X \in \mathcal{T} $$ \end{document}</tex-math>
<mml:math id="M2">
<mml:mrow>
<mml:mtext>if</mml:mtext>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mo>(</mml:mo>
<mml:mi>f</mml:mi>
<mml:mo>(</mml:mo>
<mml:mtext mathvariant="monospace">ACGT</mml:mtext>
<mml:mo>)</mml:mo>
<mml:mo>></mml:mo>
<mml:mn>0.15</mml:mn>
<mml:mo>)</mml:mo>
<mml:mo></mml:mo>
<mml:mo>(</mml:mo>
<mml:mi>f</mml:mi>
<mml:mo>(</mml:mo>
<mml:mtext mathvariant="monospace">GGCT</mml:mtext>
<mml:mo>)</mml:mo>
<mml:mo><</mml:mo>
<mml:mn>0.6</mml:mn>
<mml:mo>)</mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mtext>then</mml:mtext>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mi>X</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">T</mml:mi>
</mml:mrow>
</mml:math>
<graphic xlink:href="13040_2015_Article_73_TeX2GIF_Equa.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where
<inline-formula id="IEq1">
<alternatives>
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {T}$\end{document}</tex-math>
<mml:math id="M4">
<mml:mi mathvariant="script">T</mml:mi>
</mml:math>
<inline-graphic xlink:href="13040_2015_Article_73_TeX2GIF_IEq1.gif"></inline-graphic>
</alternatives>
</inline-formula>
is a taxonomic unit (for example,
<italic>E. coli</italic>
) at a given taxonomic rank (for example, at the species level). Similar to recent
<sc>DNA</sc>
barcoding efforts, such formulas approximate a unique signature of set
<inline-formula id="IEq2">
<alternatives>
<tex-math id="M5">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {T}$\end{document}</tex-math>
<mml:math id="M6">
<mml:mi mathvariant="script">T</mml:mi>
</mml:math>
<inline-graphic xlink:href="13040_2015_Article_73_TeX2GIF_IEq2.gif"></inline-graphic>
</alternatives>
</inline-formula>
, but they work on entire genomes rather than on few specific genes, and they do not require
<inline-formula id="IEq3">
<alternatives>
<tex-math id="M7">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {T}$\end{document}</tex-math>
<mml:math id="M8">
<mml:mi mathvariant="script">T</mml:mi>
</mml:math>
<inline-graphic xlink:href="13040_2015_Article_73_TeX2GIF_IEq3.gif"></inline-graphic>
</alternatives>
</inline-formula>
to be at the species level [
<xref ref-type="bibr" rid="CR44">44</xref>
,
<xref ref-type="bibr" rid="CR46">46</xref>
]. Contrary to
<italic>markers</italic>
[
<xref ref-type="bibr" rid="CR47">47</xref>
<xref ref-type="bibr" rid="CR49">49</xref>
], the
<italic>k</italic>
-mers in such formulas need not to be genes, they need not to be rare in the genomes they characterize, they need not to be absent from the genomes they do not characterize. Contrary to
<italic>discriminating substrings</italic>
(see e.g. [
<xref ref-type="bibr" rid="CR50">50</xref>
] and references therein), formulas can use multiple substrings to classify, and they can link frequencies with conjunctions and disjunctions.</p>
<p>In this paper, we experiment with four rule-based algorithms [
<xref ref-type="bibr" rid="CR51">51</xref>
] that extract classification models in the form of logic formulas and we compare them with other state-of-the-art classifiers, such as Support Vector Machines [
<xref ref-type="bibr" rid="CR58">58</xref>
,
<xref ref-type="bibr" rid="CR69">69</xref>
] and Nearest Neighbor [
<xref ref-type="bibr" rid="CR70">70</xref>
]. Surprisingly, it turns out that we can reliably classify genomes at multiple taxonomic levels using a limited number of formulas, each involving few, short
<italic>k</italic>
-mers. Moreover, standard noise filtration methods have minimum impact on classification performance, suggesting that noise is automatically dampened by the formula-extraction algorithms.</p>
</sec>
<sec id="Sec2">
<title>Methods</title>
<p>In this section, we present the
<italic>Logic Alignment Free</italic>
(
<sc>LAF</sc>
) technique and software package. The aim of LAF is to classify biological sequences and assign them to their taxonomic unit with the aid of a supervised machine learning paradigm [
<xref ref-type="bibr" rid="CR51">51</xref>
] (see subsection
<xref rid="Sec4" ref-type="sec">
<italic>Supervised machine learning and rule-based classification algorithms</italic>
</xref>
for more details). LAF uses a feature vector representation of the biological sequences, and gives them as input to rule-based classification algorithms (for a detailed analysis of rule-based classification methods, see [
<xref ref-type="bibr" rid="CR52">52</xref>
]).</p>
<p>In [
<xref ref-type="bibr" rid="CR53">53</xref>
], LAF has been already successfully applied to the classification of selectively constrained
<sc>DNA</sc>
elements, which are not alignable and do not come from the same gene regions.</p>
<p>Conversely, here we present the method in detail, provide the scripts and the software, and describe its application to bacterial genomes. In the following subsections, we illustrate the feature vector representation technique, the rule-based classification algorithms, and their integration in the
<sc>LAF</sc>
framework.</p>
<sec id="Sec3">
<title>Representing the sequences as feature vectors with alignment-free methods</title>
<p>The most widespread alignment-free methods compute the frequencies of the substrings in the biological sequences, called
<italic>k</italic>
-mers (where
<italic>k</italic>
is the length of the substring). For each sequence, the substring frequencies are then represented in a vector, called frequency vector [
<xref ref-type="bibr" rid="CR12">12</xref>
,
<xref ref-type="bibr" rid="CR54">54</xref>
<xref ref-type="bibr" rid="CR57">57</xref>
]. Each element of this vector expresses the frequency of a given
<italic>k</italic>
-mer, computed by scanning a sliding window of length
<italic>k</italic>
over the sequence.</p>
<p>More formally [
<xref ref-type="bibr" rid="CR9">9</xref>
], let
<italic>S</italic>
be a sequence of
<italic>n</italic>
characters over an alphabet
<italic>Σ</italic>
, e.g.
<italic>Σ</italic>
={
<italic>A</italic>
,
<italic>C</italic>
,
<italic>G</italic>
,
<italic>T</italic>
}, and let
<italic>k</italic>
∈ [ 1…
<italic>n</italic>
]. If
<italic>K</italic>
is a generic substring of
<italic>S</italic>
of length
<italic>k</italic>
,
<italic>K</italic>
is called a
<italic>k</italic>
-mer. Let the set
<italic>V</italic>
={
<italic>K</italic>
<sub>1</sub>
,
<italic>K</italic>
<sub>2</sub>
,…,
<italic>K</italic>
<sub>
<italic>m</italic>
</sub>
} be all possible
<italic>k</italic>
-mers over
<italic>Σ</italic>
, and define
<italic>m</italic>
=|
<italic>Σ</italic>
|
<sup>
<italic>k</italic>
</sup>
to be the size of set
<italic>V</italic>
. The
<italic>k</italic>
-mers are computed by counting the occurrences of the substrings in
<italic>S</italic>
with a sliding window of length
<italic>k</italic>
over
<italic>S</italic>
, starting at position 1 and ending at position
<italic>n</italic>
<italic>k</italic>
+1. A vector
<italic>F</italic>
contains for each
<italic>k</italic>
-mer the corresponding counts
<italic>F</italic>
=
<italic>c</italic>
<sub>1</sub>
,
<italic>c</italic>
<sub>2</sub>
,…,
<italic>c</italic>
<sub>
<italic>m</italic>
</sub>
. The frequencies are then computed accordingly and stored in a vector
<italic>F</italic>
<sup></sup>
=
<italic>f</italic>
<sub>1</sub>
,
<italic>f</italic>
<sub>2</sub>
,…,
<italic>f</italic>
<sub>
<italic>m</italic>
</sub>
; for a
<italic>k</italic>
-mer
<italic>K</italic>
<sub>
<italic>i</italic>
</sub>
, the frequency is defined as
<inline-formula id="IEq4">
<alternatives>
<tex-math id="M9">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$f_{i}=\frac {c_{i}}{n-k+1}$\end{document}</tex-math>
<mml:math id="M10">
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:math>
<inline-graphic xlink:href="13040_2015_Article_73_TeX2GIF_IEq4.gif"></inline-graphic>
</alternatives>
</inline-formula>
.</p>
<p>These numerical representations of the sequences allow the use of statistical and mathematical techniques; indeed, the most used approach for sequence comparisons in alignment-free vector representations are distance measures, such as the Euclidean distance and the
<italic>d</italic>
2 distance [
<xref ref-type="bibr" rid="CR9">9</xref>
]. While the authors of [
<xref ref-type="bibr" rid="CR56">56</xref>
] use feature vector representation in combination with supervised machine learning methods, specifically Support Vector Machines [
<xref ref-type="bibr" rid="CR69">69</xref>
] for biological and text sequences, here we propose to analyze the frequency vectors with rule-based supervised machine learning algorithms. The effectiveness of this technique is investigated and tested on bacterial sequences.</p>
</sec>
<sec id="Sec4">
<title>Supervised machine learning and rule-based classification algorithms</title>
<p>The aim of this step is to classify the biological sequences into their taxonomic unit. Once the sequences are represented in a vector space, it is possible to analyze them by adopting a supervised machine learning approach, sketched in the following.</p>
<p>Given a set
<italic>B</italic>
of biological sequences, each assigned to a taxon (
<italic>training set</italic>
), a classifier is trained with these sequences in order to compute a classification model that predicts the taxon of each sequence from the values of its vector space representation. An additional set of sequences with known taxa is used to evaluate whether the model computed on the training set is able to predict correctly the taxa (the latter is called
<italic>test set</italic>
). For assessing the performance of the classifier we adopt the accuracy measure (A), also called correct rate
<inline-formula id="IEq5">
<alternatives>
<tex-math id="M11">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$A=\frac {c}{t}$\end{document}</tex-math>
<mml:math id="M12">
<mml:mi>A</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:math>
<inline-graphic xlink:href="13040_2015_Article_73_TeX2GIF_IEq5.gif"></inline-graphic>
</alternatives>
</inline-formula>
, where
<italic>c</italic>
is the number of correct classified sequences in the test set and
<italic>t</italic>
is the number of total sequences in the test set.</p>
<p>We focus on a particular type of classification methods - rule-based classifiers - which express the classification model in propositional logic form (e.g.,
<italic>if-then rules</italic>
). Rule-based classifiers have the main advantage of being able to control their dimension (in this case, the number of
<italic>k</italic>
-mers used), they are easily interpretable, and can straight-forwardly be integrated with other contextual knowledge. Several rule-based classification methods are proposed in the literature; in
<sc>LAF</sc>
we adopt the following ones: Data Mining Big (DMB) [
<xref ref-type="bibr" rid="CR59">59</xref>
,
<xref ref-type="bibr" rid="CR60">60</xref>
], RIDOR [
<xref ref-type="bibr" rid="CR61">61</xref>
], PART [
<xref ref-type="bibr" rid="CR62">62</xref>
], and RIPPER [
<xref ref-type="bibr" rid="CR63">63</xref>
]. All these methods use distinct rule extraction approaches, but – as we will see later – perform very well on the analyzed data sets of bacterial sequences. We report a brief description of these methods in the following.</p>
<p>
<bold>Data Mining Big</bold>
(DMB) [
<xref ref-type="bibr" rid="CR60">60</xref>
,
<xref ref-type="bibr" rid="CR64">64</xref>
,
<xref ref-type="bibr" rid="CR65">65</xref>
] is a rule-based classification software designed for biomedical data. It adopts optimization models that are formulated and solved in order to deal with the different steps of the data mining process. Five main steps are performed by DMB:
<list list-type="order">
<list-item>
<p>discretization: conversion of numeric attributes into nominal (discrete);</p>
</list-item>
<list-item>
<p>discrete cluster analysis: samples that are similar in the discretized space are clustered and dimension-reduced accordingly;</p>
</list-item>
<list-item>
<p>feature selection: the most relevant attributes for classification purpose are selected;</p>
</list-item>
<list-item>
<p>rule extraction: small and effective rules are extracted from training data and verified on test data;</p>
</list-item>
<list-item>
<p>classification: the extracted rules are used to classify new samples.</p>
</list-item>
</list>
</p>
<p>
<bold>RIDOR</bold>
[
<xref ref-type="bibr" rid="CR61">61</xref>
] performs rule extraction directly from the training data set. The first step is the computation of a default rule for the most frequent class (e.g., “all sequences are E. coli”). Then, it computes exception rules that represent the other classes (e.g., “except if
<italic>f</italic>
<italic>r</italic>
<italic>e</italic>
<italic>q</italic>
(
<italic>A</italic>
<italic>C</italic>
<italic>G</italic>
<italic>T</italic>
)<0.45 then the sequences are S. aureus”).</p>
<p>
<bold>PART</bold>
[
<xref ref-type="bibr" rid="CR62">62</xref>
] performs rules extraction with an indirect method. It uses the C4.5 decision tree based classification algorithm [
<xref ref-type="bibr" rid="CR66">66</xref>
], which computes a pruned decision tree for a given number of iterations. The best performing tree in terms of classification performances is chosen by PART and converted to rules for every species.</p>
<p>
<bold>RIPPER</bold>
[
<xref ref-type="bibr" rid="CR63">63</xref>
] is a direct rule extraction method based on a pruning procedure, whose aim is to minimize the error on the training set; it performs the following steps: i) growth of the rules; ii) pruning of the rules; iii) optimization of the model; iv) selection of the model. In the first step, thanks to a greedy procedure, RIPPER extracts many classification rules. Then, the rules are simplified and optimized in step two and three, respectively. Finally, the best model (i.e., set of rules) is selected.</p>
</sec>
<sec id="Sec5">
<title>Logic Alignment Free (LAF) method</title>
<p>Rule-based classifiers have been successfully used in the analysis of aligned sequences, e.g., in [
<xref ref-type="bibr" rid="CR59">59</xref>
] and [
<xref ref-type="bibr" rid="CR60">60</xref>
], where the classification of biological sequences to their species is performed by considering only sequences from the same gene region. In this case the rule-extraction procedure identifies exact gene regions and nucleotide assignments that are specific to a species; an example of such a rule could be ‘’if pos354 = T of gene 16S then the sequence belongs to E. Coli”.</p>
<p>Here we test a method for classifying biological sequences without the strict requirement of overlapping gene regions and of calculating an alignment, referred to as Logic Alignment Free (
<sc>LAF</sc>
). It is based on the frequency vector representation of the sequences. The method allows to classify non coding
<sc>DNA</sc>
that is not alignable [
<xref ref-type="bibr" rid="CR53">53</xref>
], and whole genomes, whose alignments are very computationally demanding.
<sc>LAF</sc>
adopts a supervised machine learning procedure, where a labeled training set of whole genomes is considered (labels in this case would be associated to the taxon).
<sc>LAF</sc>
would then operate with the following steps, if we take into account every genome
<italic>g</italic>
of the input data set:
<list list-type="bullet">
<list-item>
<p>The genome
<italic>g</italic>
is reverse complemented, the
<italic>k</italic>
-mers with
<italic>k</italic>
∈ [ 3…6] are counted and stored in a frequency vector
<italic>F</italic>
<sup></sup>
;</p>
</list-item>
<list-item>
<p>A matrix that contains all frequency vectors is created; the rows of the matrix are associated to the
<italic>k</italic>
-mers and the columns to the sequences (an example is given in Table
<xref rid="Tab1" ref-type="table">1</xref>
);
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>Example of frequencies vectors matrix extracted by
<sc>LAF</sc>
and provided as input to rule-based classifiers</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th align="left">
<italic>S</italic>
<italic>e</italic>
<italic>q</italic>
<sub>1</sub>
</th>
<th align="left">
<italic>S</italic>
<italic>e</italic>
<italic>q</italic>
<sub>2</sub>
</th>
<th align="left"></th>
<th align="left">
<italic>S</italic>
<italic>e</italic>
<italic>q</italic>
<sub>
<italic>n</italic>
−1</sub>
</th>
<th align="left">
<italic>S</italic>
<italic>e</italic>
<italic>q</italic>
<sub>
<italic>n</italic>
</sub>
</th>
<th align="left"></th>
</tr>
<tr>
<th align="left"></th>
<th align="left">
<italic>E. Coli</italic>
</th>
<th align="left">
<italic>E. Coli</italic>
</th>
<th align="left"></th>
<th align="left">
<italic>S. Aureus</italic>
</th>
<th align="left">
<italic>S. Aureus</italic>
</th>
<th align="left"></th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">AAA</td>
<td align="left">0.46</td>
<td align="left">0.26</td>
<td align="left"></td>
<td align="left">0.24</td>
<td align="left">0.26</td>
<td align="left"></td>
</tr>
<tr>
<td align="left">AAC</td>
<td align="left">0.12</td>
<td align="left">0.16</td>
<td align="left"></td>
<td align="left">0.23</td>
<td align="left">0.24</td>
<td align="left"></td>
</tr>
<tr>
<td align="left">AAG</td>
<td align="left">0.13</td>
<td align="left">0.23</td>
<td align="left"></td>
<td align="left">0.23</td>
<td align="left">0.22</td>
<td align="left"></td>
</tr>
<tr>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</list-item>
<list-item>
<p>The frequencies are discretized with the MDL procedure [
<xref ref-type="bibr" rid="CR67">67</xref>
] before applying RIDOR, PART and RIPPER, while DMB provides its own built-in discretization method;</p>
</list-item>
<list-item>
<p>A set of four rule-based classifiers (e.g., DMB, RIDOR, PART and RIPPER) take the matrix as input and extract the classification models and specimen to taxonomic unit assignments;</p>
</list-item>
<list-item>
<p>The above is repeated for different combinations of training / test sets.</p>
</list-item>
</list>
</p>
<p>For a compact overview of the method the reader may refer to the
<sc>LAF</sc>
flow chart drawn in Fig.
<xref rid="Fig1" ref-type="fig">1</xref>
. To compute
<italic>k</italic>
-mer counts, we adopt the Jellyfish software [
<xref ref-type="bibr" rid="CR68">68</xref>
]. Data discretization is performed using MDL [
<xref ref-type="bibr" rid="CR67">67</xref>
] or the DMB internal procedure. As rule-based classifiers implementations we employ the Weka [
<xref ref-type="bibr" rid="CR67">67</xref>
] and the DMB packages. The
<sc>LAF</sc>
method is deployed in a software package available at
<ext-link ext-link-type="uri" xlink:href="http://dmb.iasi.cnr.it/laf.php">dmb.iasi.cnr.it/laf.php</ext-link>
.
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>Flow chart of the
<sc>LAF</sc>
method</p>
</caption>
<graphic xlink:href="13040_2015_73_Fig1_HTML" id="d30e1384"></graphic>
</fig>
</p>
</sec>
<sec id="Sec6">
<title>Data sets of bacterial genomes</title>
<p>In order to prove the validity of the
<sc>LAF</sc>
technique, we chose to test the method for the classification of biological sequences belonging to the bactria domain. We downloaded 1964 bacterial genomes from the NCBI genomes database (
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/genome/browse/">www.ncbi.nlm.nih.gov/genome/browse/</ext-link>
). For every downloaded sequence, we query the NCBI taxonomy service (scripts are available at
<ext-link ext-link-type="uri" xlink:href="http://dmb.iasi.cnr.it/laf.php">dmb.iasi.cnr.it/laf.php</ext-link>
) to retrieve the full lineage, i.e., Species, Genus, Order, Class, Phylum. In order to perform an effective classification, we do not take into consideration under-represented species and therefore we filter out sequences with less than nine specimens. This step is necessary to perform a proper training of the classifiers. The final
<italic>filtered data set</italic>
is composed of 413 sequences with 25 species, 21 genera, 14 orders, 9 classes, and 6 phyla. Additionally, we also report the performances on the
<italic>original data set</italic>
(1964 bacterial genomes, 1157 species, 590 genera, 120 orders, 57 classes, and 36 phyla).</p>
</sec>
</sec>
<sec id="Sec7">
<title>Results and discussion</title>
<p>We apply
<sc>LAF</sc>
to the previously described
<italic>filtered data set</italic>
of bacterial genomes, setting
<italic>k</italic>
∈ [ 3…6] and using the four already mentioned rule-based classification algorithms by adopting a 10-fold cross validation sampling scheme. We show also the results on the
<italic>original data set</italic>
composed of 1964 sequences. Additionally, we compare the results of
<sc>LAF</sc>
with respect to the Support Vector Machine (SVM) classifier [
<xref ref-type="bibr" rid="CR69">69</xref>
] and with respect to a Nearest Neighbor approach [
<xref ref-type="bibr" rid="CR70">70</xref>
].</p>
<p>First, we test
<sc>LAF</sc>
on the filtered raw sequences without any preprocessing, obtaining very good classification performance. The accuracy of the classification algorithms for
<italic>k</italic>
=4 and multiple taxonomic levels is summarized in Table
<xref rid="Tab2" ref-type="table">2</xref>
. We focus on
<italic>k</italic>
=4 here since it is the smallest value to achieve good classification performances: increasing
<italic>k</italic>
slightly improves classification performances, but also complexity and computational time. We justify the choice of
<italic>k</italic>
=4 providing experimental evidence in Table
<xref rid="Tab3" ref-type="table">3</xref>
by focusing on the order level since similar performance is obtained at other levels. We can see that the classification accuracy only slightly increases by raising the value of
<italic>k</italic>
, but complexity and computational time significantly do. We provide also an example in Fig.
<xref rid="Fig2" ref-type="fig">2</xref>
that shows the accuracy and computational time of RIPPER with respect to increasing values of
<italic>k</italic>
. The
<italic>k</italic>
-mers extraction is linear in the size of the input, but it is worth noting that for greater values of
<italic>k</italic>
the required IO bandwidth and the size of the data matrices exponentially increase [
<xref ref-type="bibr" rid="CR68">68</xref>
], slowing down the
<italic>k</italic>
-mers extraction and the classification processes. Additionally, the value of
<italic>k</italic>
=4 resonates with a number of previous studies [
<xref ref-type="bibr" rid="CR71">71</xref>
<xref ref-type="bibr" rid="CR73">73</xref>
].
<fig id="Fig2">
<label>Fig. 2</label>
<caption>
<p>Accuracy and computational times of RIPPER with respect to increasing values of
<italic>k</italic>
on the original data set</p>
</caption>
<graphic xlink:href="13040_2015_73_Fig2_HTML" id="d30e1503"></graphic>
</fig>
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>Percent accuracy of the rule-based classifiers for each taxonomic unit (10-fold cross validation) on the filtered data set</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Level</th>
<th align="left">RIPPER</th>
<th align="left">RIDOR</th>
<th align="left">PART</th>
<th align="left">DMB</th>
<th align="left">Avg ±std.dev</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Species</td>
<td align="left">93.21</td>
<td align="left">97.33</td>
<td align="left">96.36</td>
<td align="left">
<bold>97.61</bold>
</td>
<td align="left">96.13 ±2.0</td>
</tr>
<tr>
<td align="left">Genus</td>
<td align="left">93.98</td>
<td align="left">
<bold>98.79</bold>
</td>
<td align="left">97.10</td>
<td align="left">98.44</td>
<td align="left">97.08 ±2.2</td>
</tr>
<tr>
<td align="left">Order</td>
<td align="left">98.79</td>
<td align="left">
<bold>99.27</bold>
</td>
<td align="left">98.31</td>
<td align="left">98.58</td>
<td align="left">98.74 ±0.4</td>
</tr>
<tr>
<td align="left">Class</td>
<td align="left">96.50</td>
<td align="left">97.81</td>
<td align="left">
<bold>98.79</bold>
</td>
<td align="left">97.06</td>
<td align="left">97.79 ±0.9</td>
</tr>
<tr>
<td align="left">Phylum</td>
<td align="left">96.88</td>
<td align="left">
<bold>98.78</bold>
</td>
<td align="left">98.07</td>
<td align="left">98.53</td>
<td align="left">98.06 ±0.8</td>
</tr>
<tr>
<td align="left">
<bold>Avg</bold>
±
<bold>std.dev</bold>
</td>
<td align="left">95.87 ±2.2</td>
<td align="left">
<bold>98.40</bold>
±0.8</td>
<td align="left">97.72 ±1.0</td>
<td align="left">98.24 ±0.4</td>
<td align="left">97.55 ±1.0</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The best performances are highlighted in bold for each taxon</p>
</table-wrap-foot>
</table-wrap>
<table-wrap id="Tab3">
<label>Table 3</label>
<caption>
<p>Accuracy (ACC) [%] and computational times (T) [sec] on the order level with different values of
<italic>K</italic>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Data set</th>
<th align="left">Classifier</th>
<th align="left" colspan="2">K=3</th>
<th align="left" colspan="2">K=4</th>
<th align="left" colspan="2">K=5</th>
<th align="left" colspan="2">K=6</th>
</tr>
<tr>
<th align="left"></th>
<th align="left"></th>
<th align="left">ACC</th>
<th align="left">T</th>
<th align="left">ACC</th>
<th align="left">T [s]</th>
<th align="left">ACC</th>
<th align="left">T</th>
<th align="left">ACC</th>
<th align="left">T</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Original</td>
<td align="left">RIPPER</td>
<td align="left">64.50</td>
<td align="left">
<italic>37.08</italic>
</td>
<td align="left">69.82</td>
<td align="left">83.53</td>
<td align="left">69.76</td>
<td align="left">203.53</td>
<td align="left">69.92</td>
<td align="left">765.34</td>
</tr>
<tr>
<td align="left">Original</td>
<td align="left">RIDOR</td>
<td align="left">61.63</td>
<td align="left">
<italic>71.17</italic>
</td>
<td align="left">62.25</td>
<td align="left">320.72</td>
<td align="left">64.19</td>
<td align="left">1509.75</td>
<td align="left">64.75</td>
<td align="left">10320.40</td>
</tr>
<tr>
<td align="left">Original</td>
<td align="left">PART</td>
<td align="left">65.37</td>
<td align="left">
<italic>12.67</italic>
</td>
<td align="left">67.05</td>
<td align="left">24.58</td>
<td align="left">67.77</td>
<td align="left">70.13</td>
<td align="left">70.02</td>
<td align="left">280.23</td>
</tr>
<tr>
<td align="left">Original</td>
<td align="left">SVM</td>
<td align="left">70.69</td>
<td align="left">
<italic>605.55</italic>
</td>
<td align="left">85.37</td>
<td align="left">937.32</td>
<td align="left">88.59</td>
<td align="left">1312.52</td>
<td align="left">89.56</td>
<td align="left">2020.60</td>
</tr>
<tr>
<td align="left">Original</td>
<td align="left">NN</td>
<td align="left">83.27</td>
<td align="left">
<italic>9.56</italic>
</td>
<td align="left">85.67</td>
<td align="left">12.13</td>
<td align="left">86.49</td>
<td align="left">19.34</td>
<td align="left">87.06</td>
<td align="left">114.48</td>
</tr>
<tr>
<td align="left">Filtered</td>
<td align="left">RIPPER</td>
<td align="left">98.79</td>
<td align="left">
<italic>0.82</italic>
</td>
<td align="left">98.79</td>
<td align="left">1.55</td>
<td align="left">99.27</td>
<td align="left">4.56</td>
<td align="left">98.79</td>
<td align="left">27.76</td>
</tr>
<tr>
<td align="left">Filtered</td>
<td align="left">RIDOR</td>
<td align="left">96.12</td>
<td align="left">
<italic>1.58</italic>
</td>
<td align="left">99.27</td>
<td align="left">3.05</td>
<td align="left">96.36</td>
<td align="left">26.16</td>
<td align="left">97.33</td>
<td align="left">34.31</td>
</tr>
<tr>
<td align="left">Filtered</td>
<td align="left">PART</td>
<td align="left">97.34</td>
<td align="left">
<italic>0.51</italic>
</td>
<td align="left">98.31</td>
<td align="left">1.00</td>
<td align="left">97.58</td>
<td align="left">2.28</td>
<td align="left">97.09</td>
<td align="left">23.11</td>
</tr>
<tr>
<td align="left">Filtered</td>
<td align="left">SVM</td>
<td align="left">99.56</td>
<td align="left">
<italic>10.62</italic>
</td>
<td align="left">99.87</td>
<td align="left">11.58</td>
<td align="left">99.65</td>
<td align="left">13.10</td>
<td align="left">99.68</td>
<td align="left">14.71</td>
</tr>
<tr>
<td align="left">Filtered</td>
<td align="left">NN</td>
<td align="left">99.45</td>
<td align="left">
<italic>1.99</italic>
</td>
<td align="left">99.93</td>
<td align="left">3.30</td>
<td align="left">99.34</td>
<td align="left">3.70</td>
<td align="left">99.63</td>
<td align="left">4.18</td>
</tr>
<tr>
<td align="left">
<bold>Average</bold>
</td>
<td align="left">-</td>
<td align="left">83.67</td>
<td align="left">
<italic>75.2</italic>
</td>
<td align="left">86.63</td>
<td align="left">139.88</td>
<td align="left">86.90</td>
<td align="left">316.51</td>
<td align="left">87.38</td>
<td align="left">1360.51</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>In Table
<xref rid="Tab2" ref-type="table">2</xref>
, we report the average accuracy over all classification algorithms on the
<italic>filtered data set</italic>
. We note that the best results (98 % accuracy) are obtained for the phylum level – the highest in the taxonomy. Accuracy remains greater than 96 % at lower levels as well. According to the average over all taxonomic levels, RIDOR exhibits the best performance.</p>
<p>Moreover, we compare
<sc>LAF</sc>
with respect to the Support Vector Machine (SVM) classifier. We adopt the Weka implementation of SVM (called SMO) with a linear kernel and a soft margin. We obtain an accuracy of 99 % on the filtered data sets with a 10-fold cross validation sampling scheme, which slightly outperforms
<sc>LAF</sc>
. But we remark that SVM outputs just a single classification model that cannot be easily interpreted by human experts.</p>
<p>Finally, we evaluate also the performances of the
<italic>Nearest Neighbour</italic>
(NN) classifier by using the Weka implementation of NN (called IBk) and by setting the number of neighbours to 1, the NN search algorithm to linear, and by adopting the Euclidean distance. Also in this case we obtain an accuracy of 99 % on all filtered data sets with a 10-fold cross validation sampling scheme, but no human readable classification model.</p>
<p>Conversely to NN and SVM, the rule-based classification methods adopted by
<sc>LAF</sc>
provide sets of similar rules than can be analyzed, compared, and evaluated by the user. Here we consider as a sample the rules at the species level extracted by DMB, reported in Table
<xref rid="Tab4" ref-type="table">4</xref>
. A representative example of such family of rules is the one for
<italic>Helicobacter pylori</italic>
: ”if 5.56≤
<italic>f</italic>
(GTAC)<42.82 then the sample is
<italic>Helicobacter pylori</italic>
”. Here
<italic>f</italic>
(
<italic>K</italic>
) is the frequency of substring
<italic>K</italic>
(for readability, the frequency values are multiplied by 10
<sup>5</sup>
).
<table-wrap id="Tab4">
<label>Table 4</label>
<caption>
<p>A sample of classification rules at the species level extracted by the DMB software.
<italic>f</italic>
(
<italic>W</italic>
) represents the relative frequency of substring
<italic>W</italic>
in a genome, multiplied by 10
<sup>5</sup>
for readability</p>
</caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<td align="left">A. baumannii</td>
<td align="left">
<italic>f</italic>
(GTAC)≥229.10∧
<italic>f</italic>
(TGCA)≥515.63</td>
</tr>
<tr>
<td align="left">B. cereus</td>
<td align="left">384.04≤
<italic>f</italic>
(CTCA)<490.11∧819.04≤
<italic>f</italic>
(TCCA)<875.80</td>
</tr>
<tr>
<td align="left">B. animalis</td>
<td align="left">762.28≤
<italic>f</italic>
(TCCA)<819.04∧469.35≤
<italic>f</italic>
(TGCA)<515.63</td>
</tr>
<tr>
<td align="left">B. longum</td>
<td align="left">
<italic>f</italic>
(GTAC)≥229.10∧330.52≤
<italic>f</italic>
(TGCA)<376.80</td>
</tr>
<tr>
<td align="left">B. aphidicola</td>
<td align="left">57.77≤
<italic>f</italic>
(AGGC)<182.81</td>
</tr>
<tr>
<td align="left">C. jejuni</td>
<td align="left">490.11≤
<italic>f</italic>
(CTCA)<596.17∧353.97≤
<italic>f</italic>
(CTGA)<451.85</td>
</tr>
<tr>
<td align="left">C. trachomatis</td>
<td align="left">305.55≤
<italic>f</italic>
(GGAC)<393.10∧875.80≤
<italic>f</italic>
(TCCA)<932.56</td>
</tr>
<tr>
<td align="left">C. botulinum</td>
<td align="left">371.77≤
<italic>f</italic>
(ACTC)<434.37∧112.00≤
<italic>f</italic>
(GCAC)<261.71</td>
</tr>
<tr>
<td align="left">C. diphtheriae</td>
<td align="left">819.04≤
<italic>f</italic>
(TCCA)<875.80∧423.07≤
<italic>f</italic>
(TGCA)<469.35</td>
</tr>
<tr>
<td align="left">C. pseudotuberculosis</td>
<td align="left">875.80≤
<italic>f</italic>
(TCCA)<932.56∧423.07≤
<italic>f</italic>
(TGCA)<469.35</td>
</tr>
<tr>
<td align="left">E. coli</td>
<td align="left">710.86≤
<italic>f</italic>
(GCAC)<860.58∧415.84≤
<italic>f</italic>
(GCTA)<525.98</td>
</tr>
<tr>
<td align="left">F. tularensis</td>
<td align="left">592.00≤
<italic>f</italic>
(TCCA)<648.76∧330.52≤
<italic>f</italic>
(TGCA)<376.80</td>
</tr>
<tr>
<td align="left">H. influenzae</td>
<td align="left">549.73≤
<italic>f</italic>
(CTGA)<647.60∧130.47≤
<italic>f</italic>
(GGAC)<218.01</td>
</tr>
<tr>
<td align="left">H. pylori</td>
<td align="left">5.56≤
<italic>f</italic>
(GTAC)<42.82</td>
</tr>
<tr>
<td align="left">L. monocytogenes</td>
<td align="left">411.43≤
<italic>f</italic>
(GCAC)<561.15∧305.55≤
<italic>f</italic>
(GGAC)<393.10</td>
</tr>
<tr>
<td align="left">M. tuberculosis</td>
<td align="left">649.71≤
<italic>f</italic>
(ATCA)<772.78</td>
</tr>
<tr>
<td align="left">N. meningitidis</td>
<td align="left">590.29≤
<italic>f</italic>
(GATA)<754.27∧376.80≤
<italic>f</italic>
(TGCA)<423.07</td>
</tr>
<tr>
<td align="left">P. marinus</td>
<td align="left">(
<italic>f</italic>
(AGGA)<602.46∨
<italic>f</italic>
(AGGA)≥706.28)∧
<italic>f</italic>
(GCTA)<856.37</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">∧117.33≤
<italic>f</italic>
(GTAC)<154.58</td>
</tr>
<tr>
<td align="left">S. enterica</td>
<td align="left">525.98≤
<italic>f</italic>
(GCTA)<636.11∧393.10≤
<italic>f</italic>
(GGAC)<480.64</td>
</tr>
<tr>
<td align="left">S. aureus</td>
<td align="left">1082.23≤
<italic>f</italic>
(GATA)<1246.22∧
<italic>f</italic>
(GTAC)≥229.10</td>
</tr>
<tr>
<td align="left">S. pneumoniae</td>
<td align="left">393.10≤
<italic>f</italic>
(GGAC)<480.64∧154.58≤
<italic>f</italic>
(GTAC)<191.84</td>
</tr>
<tr>
<td align="left">S. pyogenes</td>
<td align="left">596.06≤
<italic>f</italic>
(AGTA)<733.86∧1082.23≤
<italic>f</italic>
(GATA)<1246.22</td>
</tr>
<tr>
<td align="left">S. suis</td>
<td align="left">918.25≤
<italic>f</italic>
(GATA)<1082.23∧330.52≤
<italic>f</italic>
(TGCA)<376.80</td>
</tr>
<tr>
<td align="left">S. islandicus</td>
<td align="left">218.01≤
<italic>f</italic>
(GGAC)<305.55∧284.24≤
<italic>f</italic>
(TGCA)<330.52</td>
</tr>
<tr>
<td align="left">Y. pestis</td>
<td align="left">596.17≤
<italic>f</italic>
(CTCA)<702.24∧
<italic>f</italic>
(CTGA)≥941.24</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>We observe that the same 4-mer is able to distinguish 3 and 2 bacterial species with different frequency values, respectively, and that twenty 4-mers suffice to separate all the 25 species. The classification rules are also very concise, since most of them are composed only by the conjuction of the conditions on two 4-mers (in the logic jargon, such rules are conjunctive clauses composed of two literals). In general, the rules computed for distinct species do not seem to use disjoint, species-specific sets of
<italic>k</italic>
-mers, suggesting that discrimination critically depends on the frequency of a
<italic>k</italic>
-mer rather than on its simple presence or absence in a species. Additional considerations derive from the granularity of the adopted discretization. The method allows to specify up-front the number of intervals used to discretize the frequency values of each
<italic>k</italic>
-mer, and then searches for an optimal discretization under this condition. From the experimental results we conclude that the number of intervals in which frequencies are discretized has minimal effects on classification quality, provided that at least 3 intervals are used (results not reported).</p>
<p>Moreover, we show the results on the
<italic>original data set</italic>
of all rule-based algorithms and compare them with SVM and NN in Table
<xref rid="Tab5" ref-type="table">5</xref>
. It is worth noting that the methods are not able to classify the bacteria genomes at species level, because of under representation (i.e., there are many species with just one or two sequences). At higher taxonomic levels (class and phylum) we obtain more reliable results. We highlight that SVM and NN perform best, but they do not provide a human readable classification model as rule-based classifiers, which permit to identify the different taxon specific
<italic>k</italic>
-mers.
<table-wrap id="Tab5">
<label>Table 5</label>
<caption>
<p>Percent accuracy of the classifiers for each taxonomic unit (10-fold cross validation) on the original data set</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Level</th>
<th align="left">RIPPER</th>
<th align="left">RIDOR</th>
<th align="left">PART</th>
<th align="left">DMB</th>
<th align="left">SVM</th>
<th align="left">NN</th>
<th align="left">Avg ±std.dev</th>
<th align="left"></th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Species</td>
<td align="left">-</td>
<td align="left">-</td>
<td align="left">-</td>
<td align="left">-</td>
<td align="left">-</td>
<td align="left">-</td>
<td align="left">-</td>
<td align="left"></td>
</tr>
<tr>
<td align="left">Genus</td>
<td align="left">54.17</td>
<td align="left">47.67</td>
<td align="left">50.17</td>
<td align="left">48.54</td>
<td align="left">-</td>
<td align="left">
<bold>73.04</bold>
</td>
<td align="left">45.60 ±24.2</td>
<td align="left"></td>
</tr>
<tr>
<td align="left">Order</td>
<td align="left">69.82</td>
<td align="left">62.25</td>
<td align="left">67.05</td>
<td align="left">63.78</td>
<td align="left">85.37</td>
<td align="left">
<bold>85.68</bold>
</td>
<td align="left">72.32 ±10.5</td>
<td align="left"></td>
</tr>
<tr>
<td align="left">Class</td>
<td align="left">75.08</td>
<td align="left">69.92</td>
<td align="left">71.76</td>
<td align="left">72.05</td>
<td align="left">88.43</td>
<td align="left">
<bold>89.10</bold>
</td>
<td align="left">77.72 ±8.7</td>
<td align="left"></td>
</tr>
<tr>
<td align="left">Phylum</td>
<td align="left">75.85</td>
<td align="left">70.99</td>
<td align="left">56.77</td>
<td align="left">71.45</td>
<td align="left">85.93</td>
<td align="left">
<bold>86.08</bold>
</td>
<td align="left">74.51 ±8.2</td>
<td align="left"></td>
</tr>
<tr>
<td align="left">
<bold>Avg</bold>
±
<bold>std.dev</bold>
</td>
<td align="left">68.73 ±10.0</td>
<td align="left">62.71 ±10.7</td>
<td align="left">61.44 ±9.7</td>
<td align="left">63.96 ±11</td>
<td align="left">64.93 ±43.3</td>
<td align="left">
<bold>83.48</bold>
±7.1</td>
<td align="left">67.54 ±14.8</td>
<td align="left"></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The best performances are highlighted in bold for each taxon</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>In order to test their effect on the classification performance, we applied different types of preprocessing to the filtered data set suggested in previous works [
<xref ref-type="bibr" rid="CR74">74</xref>
<xref ref-type="bibr" rid="CR77">77</xref>
] about phylogenetic reconstructions of genomes with alignment-free algorithms.
<list list-type="bullet">
<list-item>
<p>The first type consists in excluding all high-frequency and low-complexity substrings [
<xref ref-type="bibr" rid="CR74">74</xref>
] of a genome from its
<italic>k</italic>
-mer counts, using the DUST software implementation provided by NCBI [
<xref ref-type="bibr" rid="CR78">78</xref>
];</p>
</list-item>
<list-item>
<p>A second type of preprocessing consists in replacing the frequency
<italic>f</italic>
<sub>
<italic>T</italic>
</sub>
(
<italic>W</italic>
) of a
<italic>k</italic>
-mer
<italic>W</italic>
in a string
<italic>T</italic>
with a measure of the
<italic>statistical significance</italic>
of the event that
<italic>W</italic>
has
<italic>f</italic>
<sub>
<italic>T</italic>
</sub>
(
<italic>W</italic>
) occurrences in
<italic>T</italic>
. Specifically, we assigned to a
<italic>k</italic>
-mer
<italic>W</italic>
the score
<inline-formula id="IEq6">
<alternatives>
<tex-math id="M13">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$z_{T}(W) = \left (p_{T}(W)-\tilde {p}_{T}(W)\right)\!/\tilde {p}_{T}(W)$\end{document}</tex-math>
<mml:math id="M14">
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>W</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfenced close=")" open="(" separators="">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>W</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>W</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mfenced>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mo>/</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>W</mml:mi>
<mml:mo>)</mml:mo>
</mml:math>
<inline-graphic xlink:href="13040_2015_Article_73_TeX2GIF_IEq6.gif"></inline-graphic>
</alternatives>
</inline-formula>
, where
<italic>p</italic>
<sub>
<italic>T</italic>
</sub>
(
<italic>W</italic>
)=
<italic>f</italic>
<sub>
<italic>T</italic>
</sub>
(
<italic>W</italic>
)/(|
<italic>T</italic>
|−
<italic>k</italic>
+1), and where
<inline-formula id="IEq7">
<alternatives>
<tex-math id="M15">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {p}_{T}(W) = p_{T}(W[\!1..k-1]) \cdot p_{T}(W[\!2..k]) / p_{T}(W[\!2..k-1])$\end{document}</tex-math>
<mml:math id="M16">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>W</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>W</mml:mi>
<mml:mo>[</mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mn>1</mml:mn>
<mml:mi>..k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>]</mml:mo>
<mml:mo>)</mml:mo>
<mml:mo>·</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>W</mml:mi>
<mml:mo>[</mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mn>2</mml:mn>
<mml:mi>..k</mml:mi>
<mml:mo>]</mml:mo>
<mml:mo>)</mml:mo>
<mml:mo>/</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>W</mml:mi>
<mml:mo>[</mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mn>2</mml:mn>
<mml:mi>..k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>]</mml:mo>
<mml:mo>)</mml:mo>
</mml:math>
<inline-graphic xlink:href="13040_2015_Article_73_TeX2GIF_IEq7.gif"></inline-graphic>
</alternatives>
</inline-formula>
is the expected value of
<italic>p</italic>
<sub>
<italic>T</italic>
</sub>
(
<italic>W</italic>
) under the assumption that
<italic>T</italic>
was generated by a Markov process of order
<italic>k</italic>
−2 or smaller. This score has been shown to be critical in building accurate phylogenies of distantly-related prokaryotes [
<xref ref-type="bibr" rid="CR75">75</xref>
];</p>
</list-item>
<list-item>
<p>We experimented with the estimator
<inline-formula id="IEq8">
<alternatives>
<tex-math id="M17">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {p}_{T}(W) = \left (f_{T}(W[\!1]) \cdot f_{T}(W[\!2..k]) + f_{T}(W[\!1..k-1]) \cdot f_{T}(W[\!k]) \right)/2$\end{document}</tex-math>
<mml:math id="M18">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>W</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfenced close=")" open="(" separators="">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>W</mml:mi>
<mml:mo>[</mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mn>1</mml:mn>
<mml:mo>]</mml:mo>
<mml:mo>)</mml:mo>
<mml:mo>·</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>W</mml:mi>
<mml:mo>[</mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mn>2</mml:mn>
<mml:mi>..k</mml:mi>
<mml:mo>]</mml:mo>
<mml:mo>)</mml:mo>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>W</mml:mi>
<mml:mo>[</mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mn>1</mml:mn>
<mml:mi>..k</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>]</mml:mo>
<mml:mo>)</mml:mo>
<mml:mo>·</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>W</mml:mi>
<mml:mo>[</mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mi>k</mml:mi>
<mml:mo>]</mml:mo>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mfenced>
<mml:mo>/</mml:mo>
<mml:mn>2</mml:mn>
</mml:math>
<inline-graphic xlink:href="13040_2015_Article_73_TeX2GIF_IEq8.gif"></inline-graphic>
</alternatives>
</inline-formula>
, derived under the assumption that
<italic>W</italic>
[ 2..
<italic>k</italic>
−1],
<italic>W</italic>
[ 1] and
<italic>W</italic>
[
<italic>k</italic>
] occur independently in
<italic>T</italic>
[
<xref ref-type="bibr" rid="CR76">76</xref>
];</p>
</list-item>
<list-item>
<p>We also adopted an even simpler estimator, based on single-nucleotide frequencies (see [
<xref ref-type="bibr" rid="CR9">9</xref>
,
<xref ref-type="bibr" rid="CR77">77</xref>
] and references therein for alternative ways to compute
<inline-formula id="IEq9">
<alternatives>
<tex-math id="M19">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {p}_{T}(W)$\end{document}</tex-math>
<mml:math id="M20">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mo>~</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>W</mml:mi>
<mml:mo>)</mml:mo>
</mml:math>
<inline-graphic xlink:href="13040_2015_Article_73_TeX2GIF_IEq9.gif"></inline-graphic>
</alternatives>
</inline-formula>
).</p>
</list-item>
</list>
</p>
<p>In our experiments, none of these preprocessing methods yielded a visible improvement on classification quality, suggesting that noise is automatically dampened by the formula-extraction algorithms run on raw frequencies. Nonetheless, we include in our
<sc>LAF</sc>
package an implementation of all such filters, since they could be useful in other data sets.</p>
</sec>
<sec id="Sec8">
<title>Conclusions and future work</title>
<p>The
<sc>LAF</sc>
method combines
<italic>k</italic>
-mer composition vectors and rule-based classification algorithms to classify biological sequences. Such sequences do not need to be aligned or to belong to the same gene. The method was applied to bacterial whole genomes, and it was able to perform with accurate classification results and to identify common subsequences (
<italic>k</italic>
-mers) in each taxon (class) of the data set.</p>
<p>We compared our method with other state-of-the art classification methods and provided experimental results that show promising performance of
<sc>LAF</sc>
in particular in the classification model extraction (i.e., specific
<italic>k</italic>
-mers for each taxon).</p>
<p>Several directions for future research stem from the results obtained in this paper: further reducing the size of the classification models, analyzing more deeply the
<italic>k</italic>
-mers selected by our models; and measuring how classification performance degenerates when moving from whole genomes to short fragments.</p>
<p>Another possible way to further reduce the size of our models consists in building
<italic>hierarchical</italic>
classification rules by extracting logic formulas that best discriminate between elements in a taxonomic unit
<inline-formula id="IEq10">
<alternatives>
<tex-math id="M21">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {T}$\end{document}</tex-math>
<mml:math id="M22">
<mml:mi mathvariant="script">T</mml:mi>
</mml:math>
<inline-graphic xlink:href="13040_2015_Article_73_TeX2GIF_IEq10.gif"></inline-graphic>
</alternatives>
</inline-formula>
and elements in
<inline-formula id="IEq11">
<alternatives>
<tex-math id="M23">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\text {\texttt {parent}}(\mathcal {T}) \backslash \mathcal {T}$\end{document}</tex-math>
<mml:math id="M24">
<mml:mtext>parent</mml:mtext>
<mml:mo>(</mml:mo>
<mml:mi mathvariant="script">T</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">T</mml:mi>
</mml:math>
<inline-graphic xlink:href="13040_2015_Article_73_TeX2GIF_IEq11.gif"></inline-graphic>
</alternatives>
</inline-formula>
, where
<inline-formula id="IEq12">
<alternatives>
<tex-math id="M25">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\text {\texttt {parent}}(\mathcal {T})$\end{document}</tex-math>
<mml:math id="M26">
<mml:mtext>parent</mml:mtext>
<mml:mo>(</mml:mo>
<mml:mi mathvariant="script">T</mml:mi>
<mml:mo>)</mml:mo>
</mml:math>
<inline-graphic xlink:href="13040_2015_Article_73_TeX2GIF_IEq12.gif"></inline-graphic>
</alternatives>
</inline-formula>
is the parent of
<inline-formula id="IEq13">
<alternatives>
<tex-math id="M27">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {T}$\end{document}</tex-math>
<mml:math id="M28">
<mml:mi mathvariant="script">T</mml:mi>
</mml:math>
<inline-graphic xlink:href="13040_2015_Article_73_TeX2GIF_IEq13.gif"></inline-graphic>
</alternatives>
</inline-formula>
in the taxonomic tree. Such result would look very similar to a decision tree, and the corresponding
<italic>k</italic>
-mers could be related to the notion of
<italic>crowns</italic>
(see [
<xref ref-type="bibr" rid="CR79">79</xref>
]).</p>
<p>Analyzing the actual
<italic>k</italic>
-mers selected by our models is another obvious open direction, for example in terms of syntactic similarity and positional correlations between the
<italic>k</italic>
-mers that appear in the same formula, or in terms of enrichment of such
<italic>k</italic>
-mers in regulatory regions or in gene families devoted to specific cellular processes.</p>
<p>It is also of interest the understanding of how the classification performance degenerates when moving from whole genomes to short fragments, for example by determining how small a fragment we can classify correctly using the formulas learned from entire genomes, or using new formulas learned from fragments.
<italic>Abundance estimation</italic>
in metagenomic samples is also a natural application for the strong biases in the relative frequency of
<italic>k</italic>
-mers that we report here: given a set of observed
<italic>k</italic>
-mer frequencies in a sample, and a set of logic rules in sequenced genomes, the problem would then amount to compute the most probable abundance of known species in the sample.</p>
</sec>
</body>
<back>
<fn-group>
<fn>
<p>
<bold>Competing interests</bold>
</p>
<p>The authors declare that they have no competing interests.</p>
</fn>
<fn>
<p>
<bold>Authors’ contributions</bold>
</p>
<p>EW designed and implemented the method, planned and executed the experiments, and wrote the paper. FC inspired the research, contributed to the design of the method, suggested the statistical corrections, and wrote the paper. GF directed research, contributed to the design of the experiments, and wrote the paper. All authors read and approved the final manuscript.</p>
</fn>
</fn-group>
<ack>
<title>Acknowledgements</title>
<p>The authors are grateful to the organizing committee of the 5th Biological Discovery Workshop (Biokdd 2014) for inviting them to write and publish the manuscript in Biodata Mining. The authors would like to thank Giulia Fiscon for the precious advices and for revising the paper and prof. Paola Bertolazzi for providing a stimulating research environment and fruitful scientific discussions. This paper is dedicated to prof. Alberto Apostolico. The authors have been supported by the Italian PRIN “GenData 2020” (2010RTFWBH), the FLAGSHIP “InterOmics” project (PB.P05), and by Academy of Finland under grant 250345 (Center of Excellence in Cancer Genetics Research).</p>
</ack>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pearson</surname>
<given-names>WR</given-names>
</name>
</person-group>
<article-title>Searching protein sequence libraries: comparison of the sensitivity and selectivity of the smith-waterman and fasta algorithms</article-title>
<source>Genomics</source>
<year>1991</year>
<volume>11</volume>
<issue>3</issue>
<fpage>635</fpage>
<lpage>50</lpage>
<pub-id pub-id-type="doi">10.1016/0888-7543(91)90071-L</pub-id>
<pub-id pub-id-type="pmid">1774068</pub-id>
</element-citation>
</ref>
<ref id="CR2">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Needleman</surname>
<given-names>SB</given-names>
</name>
<name>
<surname>Wunsch</surname>
<given-names>CD</given-names>
</name>
</person-group>
<article-title>A general method applicable to the search for similarities in the amino acid sequence of two proteins</article-title>
<source>J Mol Biol</source>
<year>1970</year>
<volume>48</volume>
<issue>3</issue>
<fpage>443</fpage>
<lpage>53</lpage>
<pub-id pub-id-type="doi">10.1016/0022-2836(70)90057-4</pub-id>
<pub-id pub-id-type="pmid">5420325</pub-id>
</element-citation>
</ref>
<ref id="CR3">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pearson</surname>
<given-names>WR</given-names>
</name>
</person-group>
<article-title>Rapid and sensitive sequence comparison with fastp and fasta</article-title>
<source>Methods Enzymol</source>
<year>1990</year>
<volume>183</volume>
<fpage>63</fpage>
<lpage>98</lpage>
<pub-id pub-id-type="doi">10.1016/0076-6879(90)83007-V</pub-id>
<pub-id pub-id-type="pmid">2156132</pub-id>
</element-citation>
</ref>
<ref id="CR4">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name>
<surname>Madden</surname>
<given-names>TL</given-names>
</name>
<name>
<surname>Schaffer</surname>
<given-names>AA</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>W</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Gapped blast and psi-blast: a new generation of protein database search programs</article-title>
<source>Nucleic Acids Res</source>
<year>1997</year>
<volume>25</volume>
<issue>17</issue>
<fpage>3389</fpage>
<lpage>402</lpage>
<pub-id pub-id-type="doi">10.1093/nar/25.17.3389</pub-id>
<pub-id pub-id-type="pmid">9254694</pub-id>
</element-citation>
</ref>
<ref id="CR5">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Edgar</surname>
<given-names>RC</given-names>
</name>
</person-group>
<article-title>Muscle: multiple sequence alignment with high accuracy and high throughput</article-title>
<source>Nucleic Acids Res</source>
<year>2004</year>
<volume>32</volume>
<issue>5</issue>
<fpage>1792</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkh340</pub-id>
<pub-id pub-id-type="pmid">15034147</pub-id>
</element-citation>
</ref>
<ref id="CR6">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Thompson</surname>
<given-names>JD</given-names>
</name>
<name>
<surname>Gibson</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Higgins</surname>
<given-names>DG</given-names>
</name>
</person-group>
<article-title>Multiple sequence alignment using clustalw and clustalx</article-title>
<source>Curr Protocol Bioinformatics</source>
<year>2002</year>
<volume>00</volume>
<fpage>2.3.1</fpage>
<lpage>2.3.22</lpage>
</element-citation>
</ref>
<ref id="CR7">
<label>7</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Mokaddem</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Elloumi</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Motalign: A multiple sequence alignment algorithm based on a new distance and a new score function</article-title>
<source>DEXA Workshops</source>
<year>2013</year>
<publisher-loc>Los Alamitos, CA, USA</publisher-loc>
<publisher-name>IEEE Computer Society</publisher-name>
</element-citation>
</ref>
<ref id="CR8">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Katoh</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Misawa</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Kuma</surname>
<given-names>K-i</given-names>
</name>
<name>
<surname>Miyata</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Mafft: a novel method for rapid multiple sequence alignment based on fast fourier transform</article-title>
<source>Nucleic Acids Res</source>
<year>2002</year>
<volume>30</volume>
<issue>14</issue>
<fpage>3059</fpage>
<lpage>66</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkf436</pub-id>
<pub-id pub-id-type="pmid">12136088</pub-id>
</element-citation>
</ref>
<ref id="CR9">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vinga</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Almeida</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Alignment-free sequence comparison-a review</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>19</volume>
<issue>4</issue>
<fpage>513</fpage>
<lpage>23</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg005</pub-id>
<pub-id pub-id-type="pmid">12611807</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Delcher</surname>
<given-names>AL</given-names>
</name>
<name>
<surname>Kasif</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Fleischmann</surname>
<given-names>RD</given-names>
</name>
<name>
<surname>Peterson</surname>
<given-names>J</given-names>
</name>
<name>
<surname>White</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>Alignment of whole genomes</article-title>
<source>Nucleic Acids Res</source>
<year>1999</year>
<volume>27</volume>
<issue>11</issue>
<fpage>2369</fpage>
<lpage>76</lpage>
<pub-id pub-id-type="doi">10.1093/nar/27.11.2369</pub-id>
<pub-id pub-id-type="pmid">10325427</pub-id>
</element-citation>
</ref>
<ref id="CR11">
<label>11</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Vitnyi</surname>
<given-names>PMB</given-names>
</name>
</person-group>
<source>An Introduction to Kolmogorov Complexity and Its Applications</source>
<year>2008</year>
<publisher-loc>New York, USA</publisher-loc>
<publisher-name>Springer</publisher-name>
</element-citation>
</ref>
<ref id="CR12">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Almeida</surname>
<given-names>JS</given-names>
</name>
<name>
<surname>Vinga</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Universal sequence map (usm) of arbitrary discrete sequences</article-title>
<source>BMC Bioinformatics</source>
<year>2002</year>
<volume>3</volume>
<fpage>6</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-3-6</pub-id>
<pub-id pub-id-type="pmid">11895567</pub-id>
</element-citation>
</ref>
<ref id="CR13">
<label>13</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Vinga</surname>
<given-names>S</given-names>
</name>
</person-group>
<person-group person-group-type="editor">
<name>
<surname>Pham</surname>
<given-names>TD</given-names>
</name>
<name>
<surname>Yan</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Crane</surname>
<given-names>DI</given-names>
</name>
</person-group>
<article-title>Biological sequence analysis by vector-valued functions: revisiting alignment-free methodologies for DNA and protein classification</article-title>
<source>Advanced Computational Methods for Biocomputing and Bioimaging</source>
<year>2007</year>
<publisher-loc>New York</publisher-loc>
<publisher-name>Nova Science Publishers</publisher-name>
</element-citation>
</ref>
<ref id="CR14">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vinga</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Almeida</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Alignment-free sequence comparison – a review</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>19</volume>
<issue>4</issue>
<fpage>513</fpage>
<lpage>23</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg005</pub-id>
<pub-id pub-id-type="pmid">12611807</pub-id>
</element-citation>
</ref>
<ref id="CR15">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bentley</surname>
<given-names>SD</given-names>
</name>
<name>
<surname>Parkhill</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Comparative genomic structure of prokaryotes</article-title>
<source>Annu Rev Genet</source>
<year>2004</year>
<volume>38</volume>
<fpage>771</fpage>
<lpage>91</lpage>
<pub-id pub-id-type="doi">10.1146/annurev.genet.38.072902.094318</pub-id>
<pub-id pub-id-type="pmid">15568993</pub-id>
</element-citation>
</ref>
<ref id="CR16">
<label>16</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Josse</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Kaiser</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Kornberg</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Enzymatic synthesis of deoxyribonucleic acid</article-title>
<source>J Biol Chem</source>
<year>1961</year>
<volume>236</volume>
<fpage>864</fpage>
<lpage>75</lpage>
<pub-id pub-id-type="pmid">13790780</pub-id>
</element-citation>
</ref>
<ref id="CR17">
<label>17</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Trautner</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Swartz</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kornberg</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Enzymatic synthesis of deoxyribonucleic acid, x. influence of bromouracil substitutions on replication</article-title>
<source>Proc Natl Acad Sci U S A.</source>
<year>1962</year>
<volume>48</volume>
<issue>3</issue>
<fpage>449</fpage>
<pub-id pub-id-type="doi">10.1073/pnas.48.3.449</pub-id>
<pub-id pub-id-type="pmid">13922323</pub-id>
</element-citation>
</ref>
<ref id="CR18">
<label>18</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Russell</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Walker</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Elton</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Subak-Sharpe</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Doublet frequency analysis of fractionated vertebrate nuclear DNA</article-title>
<source>J Mol Biol</source>
<year>1976</year>
<volume>108</volume>
<issue>1</issue>
<fpage>1</fpage>
<lpage>20</lpage>
<pub-id pub-id-type="doi">10.1016/S0022-2836(76)80090-3</pub-id>
<pub-id pub-id-type="pmid">1003479</pub-id>
</element-citation>
</ref>
<ref id="CR19">
<label>19</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Russell</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Subak-Sharpe</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Similarity of the general designs of protochordates and invertebrates</article-title>
<source>Nature</source>
<year>1977</year>
<volume>266</volume>
<issue>5602</issue>
<fpage>533</fpage>
<lpage>6</lpage>
<pub-id pub-id-type="doi">10.1038/266533a0</pub-id>
<pub-id pub-id-type="pmid">558523</pub-id>
</element-citation>
</ref>
<ref id="CR20">
<label>20</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Karlin</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Burge</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Dinucleotide relative abundance extremes: a genomic signature</article-title>
<source>Trends Genet</source>
<year>1995</year>
<volume>11</volume>
<issue>7</issue>
<fpage>283</fpage>
<lpage>90</lpage>
<pub-id pub-id-type="doi">10.1016/S0168-9525(00)89076-9</pub-id>
<pub-id pub-id-type="pmid">7482779</pub-id>
</element-citation>
</ref>
<ref id="CR21">
<label>21</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Karlin</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Mrázek</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Compositional differences within and between eukaryotic genomes</article-title>
<source>Proc Natl Acad Sci</source>
<year>1997</year>
<volume>94</volume>
<issue>19</issue>
<fpage>10227</fpage>
<lpage>32</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.94.19.10227</pub-id>
<pub-id pub-id-type="pmid">9294192</pub-id>
</element-citation>
</ref>
<ref id="CR22">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Teeling</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Meyerdierks</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Bauer</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Amann</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Glöckner</surname>
<given-names>FO</given-names>
</name>
</person-group>
<article-title>Application of tetranucleotide frequencies for the assignment of genomic fragments</article-title>
<source>Environ Microbiol</source>
<year>2004</year>
<volume>6</volume>
<issue>9</issue>
<fpage>938</fpage>
<lpage>47</lpage>
<pub-id pub-id-type="doi">10.1111/j.1462-2920.2004.00624.x</pub-id>
<pub-id pub-id-type="pmid">15305919</pub-id>
</element-citation>
</ref>
<ref id="CR23">
<label>23</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Olman</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>Barcodes for genomes and applications</article-title>
<source>BMC Bioinformatics</source>
<year>2008</year>
<volume>9</volume>
<issue>1</issue>
<fpage>546</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-9-546</pub-id>
<pub-id pub-id-type="pmid">19091119</pub-id>
</element-citation>
</ref>
<ref id="CR24">
<label>24</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Deschavanne</surname>
<given-names>PJ</given-names>
</name>
<name>
<surname>Giron</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Vilain</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Fagot</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Fertil</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Genomic signature: characterization and classification of species assessed by chaos game representation of sequences</article-title>
<source>Mol Biol Evol</source>
<year>1999</year>
<volume>16</volume>
<issue>10</issue>
<fpage>1391</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1093/oxfordjournals.molbev.a026048</pub-id>
<pub-id pub-id-type="pmid">10563018</pub-id>
</element-citation>
</ref>
<ref id="CR25">
<label>25</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sandberg</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Winberg</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Bränden</surname>
<given-names>CI</given-names>
</name>
<name>
<surname>Kaske</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ernberg</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Cöster</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Capturing whole-genome characteristics in short sequences using a naive bayesian classifier</article-title>
<source>Genome Res</source>
<year>2001</year>
<volume>11</volume>
<issue>8</issue>
<fpage>1404</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1101/gr.186401</pub-id>
<pub-id pub-id-type="pmid">11483581</pub-id>
</element-citation>
</ref>
<ref id="CR26">
<label>26</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pride</surname>
<given-names>DT</given-names>
</name>
<name>
<surname>Meinersmann</surname>
<given-names>RJ</given-names>
</name>
<name>
<surname>Wassenaar</surname>
<given-names>TM</given-names>
</name>
<name>
<surname>Blaser</surname>
<given-names>MJ</given-names>
</name>
</person-group>
<article-title>Evolutionary implications of microbial genome tetranucleotide frequency biases</article-title>
<source>Genome Res</source>
<year>2003</year>
<volume>13</volume>
<issue>2</issue>
<fpage>145</fpage>
<lpage>58</lpage>
<pub-id pub-id-type="doi">10.1101/gr.335003</pub-id>
<pub-id pub-id-type="pmid">12566393</pub-id>
</element-citation>
</ref>
<ref id="CR27">
<label>27</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gatherer</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Genome signatures, self-organizing maps and higher order phylogenies: A parametric analysis</article-title>
<source>Evol Bioinformatics Online</source>
<year>2007</year>
<volume>3</volume>
<fpage>211</fpage>
</element-citation>
</ref>
<ref id="CR28">
<label>28</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Takahashi</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kryukov</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Saitou</surname>
<given-names>N</given-names>
</name>
</person-group>
<article-title>Estimation of bacterial species phylogeny through oligonucleotide frequency distances</article-title>
<source>Genomics</source>
<year>2009</year>
<volume>93</volume>
<issue>6</issue>
<fpage>525</fpage>
<lpage>33</lpage>
<pub-id pub-id-type="doi">10.1016/j.ygeno.2009.01.009</pub-id>
<pub-id pub-id-type="pmid">19442633</pub-id>
</element-citation>
</ref>
<ref id="CR29">
<label>29</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Teeling</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Waldmann</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Lombardot</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Bauer</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Glockner</surname>
<given-names>FO</given-names>
</name>
</person-group>
<article-title>Tetra: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in dna sequences</article-title>
<source>BMC Bioinformatics</source>
<year>2004</year>
<volume>5</volume>
<issue>1</issue>
<fpage>163</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-5-163</pub-id>
<pub-id pub-id-type="pmid">15507136</pub-id>
</element-citation>
</ref>
<ref id="CR30">
<label>30</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rigoutsos</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Floratos</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ouzounis</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Parida</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins</article-title>
<source>Proteins</source>
<year>1999</year>
<volume>37</volume>
<issue>2</issue>
<fpage>264</fpage>
<lpage>77</lpage>
<pub-id pub-id-type="doi">10.1002/(SICI)1097-0134(19991101)37:2<264::AID-PROT11>3.0.CO;2-C</pub-id>
<pub-id pub-id-type="pmid">10584071</pub-id>
</element-citation>
</ref>
<ref id="CR31">
<label>31</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chor</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Horn</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Goldman</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Levy</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Massingham</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Genomic DNA
<italic>k</italic>
-mer spectra: models and modalities</article-title>
<source>Genome Biol</source>
<year>2009</year>
<volume>10</volume>
<issue>10</issue>
<fpage>108</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2009-10-10-r108</pub-id>
<pub-id pub-id-type="pmid">19591645</pub-id>
</element-citation>
</ref>
<ref id="CR32">
<label>32</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Oğul</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Mumcuoğlu</surname>
<given-names></given-names>
</name>
</person-group>
<article-title>Svm-based detection of distant protein structural relationships using pairwise probabilistic suffix trees</article-title>
<source>Comput Biol Chem</source>
<year>2006</year>
<volume>30</volume>
<issue>4</issue>
<fpage>292</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1016/j.compbiolchem.2006.05.001</pub-id>
<pub-id pub-id-type="pmid">16880118</pub-id>
</element-citation>
</ref>
<ref id="CR33">
<label>33</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Karlin</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Mrazek</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Campbell</surname>
<given-names>AM</given-names>
</name>
</person-group>
<article-title>Compositional biases of bacterial genomes and evolutionary implications</article-title>
<source>J Bacteriol</source>
<year>1997</year>
<volume>179</volume>
<issue>12</issue>
<fpage>3899</fpage>
<lpage>913</lpage>
<pub-id pub-id-type="doi">10.1128/jb.179.12.3899-3913.1997</pub-id>
<pub-id pub-id-type="pmid">9190805</pub-id>
</element-citation>
</ref>
<ref id="CR34">
<label>34</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Foerstner</surname>
<given-names>KU</given-names>
</name>
<name>
<surname>von Mering</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Hooper</surname>
<given-names>SD</given-names>
</name>
<name>
<surname>Bork</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Environments shape the nucleotide composition of genomes</article-title>
<source>EMBO Rep</source>
<year>2005</year>
<volume>6</volume>
<issue>12</issue>
<fpage>1208</fpage>
<lpage>13</lpage>
<pub-id pub-id-type="doi">10.1038/sj.embor.7400538</pub-id>
<pub-id pub-id-type="pmid">16200051</pub-id>
</element-citation>
</ref>
<ref id="CR35">
<label>35</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>McHardy</surname>
<given-names>AC</given-names>
</name>
<name>
<surname>Martín</surname>
<given-names>HG</given-names>
</name>
<name>
<surname>Tsirigos</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Hugenholtz</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Rigoutsos</surname>
<given-names>I</given-names>
</name>
</person-group>
<article-title>Accurate phylogenetic classification of variable-length DNA fragments</article-title>
<source>Nat Methods</source>
<year>2007</year>
<volume>4</volume>
<issue>1</issue>
<fpage>63</fpage>
<lpage>72</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth976</pub-id>
<pub-id pub-id-type="pmid">17179938</pub-id>
</element-citation>
</ref>
<ref id="CR36">
<label>36</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Chatterji</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Yamazaki</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Bai</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Eisen</surname>
<given-names>JA</given-names>
</name>
</person-group>
<article-title>Compostbin: A dna composition-based algorithm for binning environmental shotgun reads</article-title>
<source>Research in Computational Molecular Biology</source>
<year>2008</year>
<publisher-loc>Berlin</publisher-loc>
<publisher-name>Springer</publisher-name>
</element-citation>
</ref>
<ref id="CR37">
<label>37</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Leung</surname>
<given-names>HC</given-names>
</name>
<name>
<surname>Yiu</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Peng</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Z</given-names>
</name>
<etal></etal>
</person-group>
<article-title>A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>11</issue>
<fpage>1489</fpage>
<lpage>95</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr186</pub-id>
<pub-id pub-id-type="pmid">21493653</pub-id>
</element-citation>
</ref>
<ref id="CR38">
<label>38</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Leung</surname>
<given-names>HC</given-names>
</name>
<name>
<surname>Yiu</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Chin</surname>
<given-names>FY</given-names>
</name>
</person-group>
<article-title>Metacluster 4.0: a novel binning algorithm for ngs reads and huge number of species</article-title>
<source>J Comput Biol</source>
<year>2012</year>
<volume>19</volume>
<issue>2</issue>
<fpage>241</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1089/cmb.2011.0276</pub-id>
<pub-id pub-id-type="pmid">22300323</pub-id>
</element-citation>
</ref>
<ref id="CR39">
<label>39</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Tanaseichuk</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Borneman</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Separating metagenomic short reads into genomes via clustering</article-title>
<source>Algorithms in Bioinformatics</source>
<year>2011</year>
<publisher-loc>New York, NY, USA</publisher-loc>
<publisher-name>Springer</publisher-name>
</element-citation>
</ref>
<ref id="CR40">
<label>40</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Song</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Zhai</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Deng</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>Alignment-free sequence comparison based on next generation sequencing reads</article-title>
<source>Research in Computational Molecular Biology</source>
<year>2012</year>
<publisher-loc>Berlin</publisher-loc>
<publisher-name>Springer</publisher-name>
</element-citation>
</ref>
<ref id="CR41">
<label>41</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Stuart</surname>
<given-names>GW</given-names>
</name>
<name>
<surname>Moffett</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Baker</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Integrated gene and species phylogenies from unaligned whole genome protein sequences</article-title>
<source>Bioinformatics</source>
<year>2002</year>
<volume>18</volume>
<issue>1</issue>
<fpage>100</fpage>
<lpage>8</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/18.1.100</pub-id>
<pub-id pub-id-type="pmid">11836217</pub-id>
</element-citation>
</ref>
<ref id="CR42">
<label>42</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Stuart</surname>
<given-names>GW</given-names>
</name>
<name>
<surname>Moffett</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Leader</surname>
<given-names>JJ</given-names>
</name>
</person-group>
<article-title>A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes</article-title>
<source>Mol Biol Evol</source>
<year>2002</year>
<volume>19</volume>
<issue>4</issue>
<fpage>554</fpage>
<lpage>62</lpage>
<pub-id pub-id-type="doi">10.1093/oxfordjournals.molbev.a004111</pub-id>
<pub-id pub-id-type="pmid">11919297</pub-id>
</element-citation>
</ref>
<ref id="CR43">
<label>43</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Comin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Verzotto</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Whole-genome phylogeny by virtue of unic subwords</article-title>
<source>Database and Expert Systems Applications (DEXA), 2012 23rd International Workshop On</source>
<year>2012</year>
<publisher-loc>Los Alamitos, CA, USA</publisher-loc>
<publisher-name>IEEE Computer Society</publisher-name>
</element-citation>
</ref>
<ref id="CR44">
<label>44</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kuksa</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Pavlovic</surname>
<given-names>V</given-names>
</name>
</person-group>
<article-title>Efficient alignment-free DNA barcode analytics</article-title>
<source>BMC Bioinformatics</source>
<year>2009</year>
<volume>10</volume>
<issue>Suppl. 14</issue>
<fpage>9</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-10-S14-S9</pub-id>
<pub-id pub-id-type="pmid">19128508</pub-id>
</element-citation>
</ref>
<ref id="CR45">
<label>45</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Solovyev</surname>
<given-names>VV</given-names>
</name>
<name>
<surname>Makarova</surname>
<given-names>KS</given-names>
</name>
</person-group>
<article-title>A novel method of protein sequence classification based on oligopeptide frequency analysis and its application to search for functional sites and to domain localization</article-title>
<source>Comput Appl Biosci: CABIOS</source>
<year>1993</year>
<volume>9</volume>
<issue>1</issue>
<fpage>17</fpage>
<lpage>24</lpage>
<pub-id pub-id-type="pmid">8435763</pub-id>
</element-citation>
</ref>
<ref id="CR46">
<label>46</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ratnasingham</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Hebert</surname>
<given-names>PDN</given-names>
</name>
</person-group>
<article-title>BOLD: the barcode of life data system</article-title>
<source>Mol Ecol Notes</source>
<year>2007</year>
<volume>7</volume>
<fpage>355</fpage>
<lpage>64</lpage>
<pub-id pub-id-type="doi">10.1111/j.1471-8286.2007.01678.x</pub-id>
<pub-id pub-id-type="pmid">18784790</pub-id>
</element-citation>
</ref>
<ref id="CR47">
<label>47</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Gibbons</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Ghodsi</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Treangen</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Pop</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences</article-title>
<source>BMC Genomics</source>
<year>2011</year>
<volume>12</volume>
<issue>Suppl 2</issue>
<fpage>4</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2164-12-S2-S4</pub-id>
<pub-id pub-id-type="pmid">21205322</pub-id>
</element-citation>
</ref>
<ref id="CR48">
<label>48</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Segata</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Waldron</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Ballarini</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Narasimhan</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Jousson</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Huttenhower</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Metagenomic microbial community profiling using unique clade-specific marker genes</article-title>
<source>Nat Methods</source>
<year>2012</year>
<volume>9</volume>
<issue>8</issue>
<fpage>811</fpage>
<lpage>4</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.2066</pub-id>
<pub-id pub-id-type="pmid">22688413</pub-id>
</element-citation>
</ref>
<ref id="CR49">
<label>49</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Edwards</surname>
<given-names>RA</given-names>
</name>
<name>
<surname>Olson</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Disz</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Pusch</surname>
<given-names>GD</given-names>
</name>
<name>
<surname>Vonstein</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Stevens</surname>
<given-names>R</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Real time metagenomics: Using k-mers to annotate metagenomes</article-title>
<source>Bioinformatics</source>
<year>2012</year>
<volume>28</volume>
<issue>24</issue>
<fpage>3316</fpage>
<lpage>17</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bts599</pub-id>
<pub-id pub-id-type="pmid">23047562</pub-id>
</element-citation>
</ref>
<ref id="CR50">
<label>50</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Seth</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Välimäki</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Kaski</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Honkela</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Exploration and retrieval of whole-metagenome sequencing samples</article-title>
<source>Bioinformatics</source>
<year>2014</year>
<volume>30</volume>
<issue>17</issue>
<fpage>2471</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btu340</pub-id>
<pub-id pub-id-type="pmid">24845653</pub-id>
</element-citation>
</ref>
<ref id="CR51">
<label>51</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Weitschek</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Fiscon</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Felici</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Supervised dna barcodes species classification: analysis, comparisons and results</article-title>
<source>BioData Mining</source>
<year>2014</year>
<volume>7</volume>
<fpage>4</fpage>
<pub-id pub-id-type="doi">10.1186/1756-0381-7-4</pub-id>
<pub-id pub-id-type="pmid">24721333</pub-id>
</element-citation>
</ref>
<ref id="CR52">
<label>52</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lehr</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Yuan</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Zeumer</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Jayadev</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Ritchie</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Rule based classifier for the analysis of gene-gene and gene-environment interactions in genetic association studies</article-title>
<source>BioData Mining</source>
<year>2011</year>
<volume>4</volume>
<issue>1</issue>
<fpage>4</fpage>
<pub-id pub-id-type="doi">10.1186/1756-0381-4-4</pub-id>
<pub-id pub-id-type="pmid">21362183</pub-id>
</element-citation>
</ref>
<ref id="CR53">
<label>53</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Polychronopoulos</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Weitschek</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Dimitrieva</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Bucher</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Felici</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Almirantis</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>Classification of selectively constrained dna elements using feature vectors and rule-based classifiers</article-title>
<source>Genomics</source>
<year>2014</year>
<volume>104</volume>
<issue>2</issue>
<fpage>79</fpage>
<lpage>86</lpage>
<pub-id pub-id-type="doi">10.1016/j.ygeno.2014.07.004</pub-id>
<pub-id pub-id-type="pmid">25058025</pub-id>
</element-citation>
</ref>
<ref id="CR54">
<label>54</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Kudenko</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Hirsh</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>Feature generation for sequence categorization</article-title>
<source>AAAI/IAAI</source>
<year>1998</year>
<publisher-loc>Cambridge, USA</publisher-loc>
<publisher-name>The MIT Press</publisher-name>
</element-citation>
</ref>
<ref id="CR55">
<label>55</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ben-Hur</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Brutlag</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Remote homology detection: a motif based approach</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>19</volume>
<issue>suppl 1</issue>
<fpage>26</fpage>
<lpage>33</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg1002</pub-id>
</element-citation>
</ref>
<ref id="CR56">
<label>56</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xing</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Pei</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Keogh</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>A brief survey on sequence classification</article-title>
<source>ACM SIGKDD Explorations Newslett</source>
<year>2010</year>
<volume>12</volume>
<issue>1</issue>
<fpage>40</fpage>
<lpage>8</lpage>
<pub-id pub-id-type="doi">10.1145/1882471.1882478</pub-id>
</element-citation>
</ref>
<ref id="CR57">
<label>57</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kuksa</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Pavlovic</surname>
<given-names>V</given-names>
</name>
</person-group>
<article-title>Efficient alignment-free dna barcode analytics</article-title>
<source>BMC Bioinformatics</source>
<year>2009</year>
<volume>10 Suppl 14</volume>
<fpage>9</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-10-S14-S9</pub-id>
<pub-id pub-id-type="pmid">19128508</pub-id>
</element-citation>
</ref>
<ref id="CR58">
<label>58</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Vapnik</surname>
<given-names>VN</given-names>
</name>
<name>
<surname>Vapnik</surname>
<given-names>V</given-names>
</name>
</person-group>
<source>Statistical Learning Theory</source>
<year>1998</year>
<publisher-loc>New York, NY, USA</publisher-loc>
<publisher-name>Wiley</publisher-name>
</element-citation>
</ref>
<ref id="CR59">
<label>59</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bertolazzi</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Felici</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Weitschek</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Learning to classify species with barcodes</article-title>
<source>BMC Bioinformatics</source>
<year>2009</year>
<volume>10</volume>
<issue>S-14</issue>
<fpage>7</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-10-S14-S7</pub-id>
<pub-id pub-id-type="pmid">19128472</pub-id>
</element-citation>
</ref>
<ref id="CR60">
<label>60</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Weitschek</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Lo Presti</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Drovandi</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Felici</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Ciccozzi</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ciotti</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Human polyomaviruses identification by logic mining techniques</article-title>
<source>BMC Virol J.</source>
<year>2012</year>
<volume>58</volume>
<issue>9</issue>
<fpage>1</fpage>
<lpage>6</lpage>
</element-citation>
</ref>
<ref id="CR61">
<label>61</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gaines</surname>
<given-names>BR</given-names>
</name>
<name>
<surname>Compton</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Induction of ripple-down rules applied to modeling large databases</article-title>
<source>J Intell Inf Syst.</source>
<year>1995</year>
<volume>5</volume>
<issue>3</issue>
<fpage>211</fpage>
<lpage>28</lpage>
<pub-id pub-id-type="doi">10.1007/BF00962234</pub-id>
</element-citation>
</ref>
<ref id="CR62">
<label>62</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Frank</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Witten</surname>
<given-names>IH</given-names>
</name>
</person-group>
<article-title>Generating accurate rule sets without global optimization</article-title>
<source>Proc. of the 15th Int. Conference on Machine Learning</source>
<year>1998</year>
<publisher-loc>San Francisco, CA, USA</publisher-loc>
<publisher-name>Morgan Kaufmann</publisher-name>
</element-citation>
</ref>
<ref id="CR63">
<label>63</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Cohen</surname>
<given-names>WW</given-names>
</name>
</person-group>
<article-title>Fast effective rule induction</article-title>
<source>Proceedings of the Twelfth International Conference on Machine Learning</source>
<year>1995</year>
<publisher-loc>San Francisco, CA, USA</publisher-loc>
<publisher-name>Morgan Kaufmann</publisher-name>
</element-citation>
</ref>
<ref id="CR64">
<label>64</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Felici</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Truemper</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>A minsat approach for learning in logic domains</article-title>
<source>INFORMS J Comput</source>
<year>2002</year>
<volume>13</volume>
<issue>3</issue>
<fpage>1</fpage>
<lpage>17</lpage>
</element-citation>
</ref>
<ref id="CR65">
<label>65</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bertolazzi</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Felici</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Weitschek</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Learning to classify species with barcodes</article-title>
<source>BMC Bioinformatics</source>
<year>2009</year>
<volume>10</volume>
<issue>S14</issue>
<fpage>7</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-10-S14-S7</pub-id>
<pub-id pub-id-type="pmid">19128472</pub-id>
</element-citation>
</ref>
<ref id="CR66">
<label>66</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Quinlan</surname>
<given-names>JR</given-names>
</name>
</person-group>
<article-title>Improved use of continuous attributes in C4.5</article-title>
<source>J Artif Intell Res</source>
<year>1996</year>
<volume>4</volume>
<fpage>77</fpage>
<lpage>90</lpage>
</element-citation>
</ref>
<ref id="CR67">
<label>67</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hall</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Frank</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Holmes</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Pfahringer</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Reutemann</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Witten</surname>
<given-names>IH</given-names>
</name>
</person-group>
<article-title>The weka data mining software: an update</article-title>
<source>SIGKDD Explor Newsl</source>
<year>2009</year>
<volume>11</volume>
<issue>1</issue>
<fpage>10</fpage>
<lpage>18</lpage>
<pub-id pub-id-type="doi">10.1145/1656274.1656278</pub-id>
</element-citation>
</ref>
<ref id="CR68">
<label>68</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marcais</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Kingsford</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>A fast, lock-free approach for efficient parallel counting of occurrences of k-mers</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>6</issue>
<fpage>764</fpage>
<lpage>70</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr011</pub-id>
<pub-id pub-id-type="pmid">21217122</pub-id>
</element-citation>
</ref>
<ref id="CR69">
<label>69</label>
<mixed-citation publication-type="other">An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge, UK: Cambridge University Press.</mixed-citation>
</ref>
<ref id="CR70">
<label>70</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Dasarathy</surname>
<given-names>BV</given-names>
</name>
</person-group>
<source>Nearest Neighbor NN Norms: NN Pattern Classification Techniques</source>
<year>1991</year>
<publisher-loc>Los Alamitos, CA, USA</publisher-loc>
<publisher-name>IEEE Computer Society Press</publisher-name>
</element-citation>
</ref>
<ref id="CR71">
<label>71</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Teeling</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Meyerdiekers</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Bauer</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Glockner</surname>
<given-names>FO</given-names>
</name>
</person-group>
<article-title>Application of tetranucleotide frequencies for the assignment of genomic fragments</article-title>
<source>Environ Microbiol</source>
<year>2004</year>
<volume>6</volume>
<issue>9</issue>
<fpage>938</fpage>
<lpage>47</lpage>
<pub-id pub-id-type="doi">10.1111/j.1462-2920.2004.00624.x</pub-id>
<pub-id pub-id-type="pmid">15305919</pub-id>
</element-citation>
</ref>
<ref id="CR72">
<label>72</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pride</surname>
<given-names>DT</given-names>
</name>
<name>
<surname>Meinersmann</surname>
<given-names>RJ</given-names>
</name>
<name>
<surname>Wassenaar</surname>
<given-names>TM</given-names>
</name>
<name>
<surname>Blaser</surname>
<given-names>MJ</given-names>
</name>
</person-group>
<article-title>Evolutionary implications of microbial genome tetranucleotide frequency biases</article-title>
<source>Genome Res</source>
<year>2003</year>
<volume>13</volume>
<fpage>145</fpage>
<lpage>58</lpage>
<pub-id pub-id-type="doi">10.1101/gr.335003</pub-id>
<pub-id pub-id-type="pmid">12566393</pub-id>
</element-citation>
</ref>
<ref id="CR73">
<label>73</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Teeling</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Waldmann</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Lombardot</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Bauer</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Glockner</surname>
<given-names>FO</given-names>
</name>
</person-group>
<article-title>Tetra: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in dna sequences</article-title>
<source>BMC Bioinformatics</source>
<year>2004</year>
<volume>5</volume>
<fpage>163</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-5-163</pub-id>
<pub-id pub-id-type="pmid">15507136</pub-id>
</element-citation>
</ref>
<ref id="CR74">
<label>74</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chan</surname>
<given-names>RH</given-names>
</name>
<name>
<surname>Chan</surname>
<given-names>TH</given-names>
</name>
<name>
<surname>Yeung</surname>
<given-names>HM</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>RW</given-names>
</name>
</person-group>
<article-title>Composition vector method based on maximum entropy principle for sequence comparison</article-title>
<source>Comput Biol Bioinform IEEE/ACM Trans</source>
<year>2012</year>
<volume>9</volume>
<issue>1</issue>
<fpage>79</fpage>
<lpage>87</lpage>
<pub-id pub-id-type="doi">10.1109/TCBB.2011.45</pub-id>
</element-citation>
</ref>
<ref id="CR75">
<label>75</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qi</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Hao</surname>
<given-names>BI</given-names>
</name>
</person-group>
<article-title>Whole proteome prokaryote phylogeny without sequence alignment: a k-string composition approach</article-title>
<source>J Mol Evol</source>
<year>2004</year>
<volume>58</volume>
<issue>1</issue>
<fpage>1</fpage>
<lpage>11</lpage>
<pub-id pub-id-type="doi">10.1007/s00239-003-2493-7</pub-id>
<pub-id pub-id-type="pmid">14743310</pub-id>
</element-citation>
</ref>
<ref id="CR76">
<label>76</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>ZG</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>LQ</given-names>
</name>
<name>
<surname>Anh</surname>
<given-names>VV</given-names>
</name>
<name>
<surname>Chu</surname>
<given-names>KH</given-names>
</name>
<name>
<surname>Long</surname>
<given-names>SC</given-names>
</name>
<name>
<surname>Deng</surname>
<given-names>JQ</given-names>
</name>
</person-group>
<article-title>Phylogeny of prokaryotes and chloroplasts revealed by a simple composition approach on all protein sequences from complete genomes without sequence alignment</article-title>
<source>J Mol Evol</source>
<year>2005</year>
<volume>60</volume>
<issue>4</issue>
<fpage>538</fpage>
<lpage>45</lpage>
<pub-id pub-id-type="doi">10.1007/s00239-004-0255-9</pub-id>
<pub-id pub-id-type="pmid">15883888</pub-id>
</element-citation>
</ref>
<ref id="CR77">
<label>77</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Song</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Reinert</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Deng</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Waterman</surname>
<given-names>MS</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing</article-title>
<source>Brief Bioinform</source>
<year>2014</year>
<volume>15</volume>
<issue>3</issue>
<fpage>343</fpage>
<lpage>53</lpage>
<pub-id pub-id-type="doi">10.1093/bib/bbt067</pub-id>
<pub-id pub-id-type="pmid">24064230</pub-id>
</element-citation>
</ref>
<ref id="CR78">
<label>78</label>
<mixed-citation publication-type="other">Blast Package Version 2.2.25-7.
<ext-link ext-link-type="uri" xlink:href="http://packages.ubuntu.com/precise/ncbi-blast+">http://packages.ubuntu.com/precise/ncbi-blast+</ext-link>
. Accessed Dec 2015.</mixed-citation>
</ref>
<ref id="CR79">
<label>79</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huang</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Brady</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Mahurkar</surname>
<given-names>A</given-names>
</name>
<name>
<surname>White</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Gevers</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Huttenhower</surname>
<given-names>C</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Metaref: a pan-genomic database for comparative and community microbial genomics</article-title>
<source>Nucleic Acids Res.</source>
<year>2014</year>
<volume>42</volume>
<fpage>617</fpage>
<lpage>24</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkt1078</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000339 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000339 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:4673791
   |texte=   LAF: Logic Alignment Free and its application to bacterial genomes classification
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:26664519" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021