MersV1, Pmc, Corpus, bibRecord, 000B31

An Efficient Parallel Algorithm for Multiple Sequence Similarities Calculation Using a Low Complexity Method

Identifieur interne : 000B31 ( Pmc/Corpus ); précédent : 000B30; suivant : 000B32

An Efficient Parallel Algorithm for Multiple Sequence Similarities Calculation Using a Low Complexity Method

Auteurs : Evandro A. Marucci ; Geraldo F. D. Zafalon ; Julio C. Momente ; Leandro A. Neves ; Carlo R. Valêncio ; Alex R. Pinto ; Adriano M. Cansian ; Rogeria C. G. De Souza ; Yang Shiyou ; José M. Machado

Source :

BioMed Research International [ 2314-6133 ] ; 2014.

RBID : PMC:4130029

Abstract

With the advance of genomic researches, the number of sequences involved in comparative methods has grown immensely. Among them, there are methods for similarities calculation, which are used by many bioinformatics applications. Due the huge amount of data, the union of low complexity methods with the use of parallel computing is becoming desirable. The k-mers counting is a very efficient method with good biological results. In this work, the development of a parallel algorithm for multiple sequence similarities calculation using the k-mers counting method is proposed. Tests show that the algorithm presents a very good scalability and a nearly linear speedup. For 14 nodes was obtained 12x speedup. This algorithm can be used in the parallelization of some multiple sequence alignment tools, such as MAFFT and MUSCLE.

Url:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4130029

DOI: 10.1155/2014/563016
PubMed: 25140318
PubMed Central: 4130029

Links to Exploration step

PMC:4130029

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">An Efficient Parallel Algorithm for Multiple Sequence Similarities Calculation Using a Low Complexity Method</title>
<author><name sortKey="Marucci, Evandro A" sort="Marucci, Evandro A" uniqKey="Marucci E" first="Evandro A." last="Marucci">Evandro A. Marucci</name>
<affiliation><nlm:aff id="I1">Department of Computer Science and Statistics, Sao Paulo State University, Rua Cristóvão Colombo 2265, 15054-000 São José do Rio Preto, SP, Brazil</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Zafalon, Geraldo F D" sort="Zafalon, Geraldo F D" uniqKey="Zafalon G" first="Geraldo F. D." last="Zafalon">Geraldo F. D. Zafalon</name>
<affiliation><nlm:aff id="I1">Department of Computer Science and Statistics, Sao Paulo State University, Rua Cristóvão Colombo 2265, 15054-000 São José do Rio Preto, SP, Brazil</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Momente, Julio C" sort="Momente, Julio C" uniqKey="Momente J" first="Julio C." last="Momente">Julio C. Momente</name>
<affiliation><nlm:aff id="I1">Department of Computer Science and Statistics, Sao Paulo State University, Rua Cristóvão Colombo 2265, 15054-000 São José do Rio Preto, SP, Brazil</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Neves, Leandro A" sort="Neves, Leandro A" uniqKey="Neves L" first="Leandro A." last="Neves">Leandro A. Neves</name>
<affiliation><nlm:aff id="I1">Department of Computer Science and Statistics, Sao Paulo State University, Rua Cristóvão Colombo 2265, 15054-000 São José do Rio Preto, SP, Brazil</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Valencio, Carlo R" sort="Valencio, Carlo R" uniqKey="Valencio C" first="Carlo R." last="Valêncio">Carlo R. Valêncio</name>
<affiliation><nlm:aff id="I1">Department of Computer Science and Statistics, Sao Paulo State University, Rua Cristóvão Colombo 2265, 15054-000 São José do Rio Preto, SP, Brazil</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Pinto, Alex R" sort="Pinto, Alex R" uniqKey="Pinto A" first="Alex R." last="Pinto">Alex R. Pinto</name>
<affiliation><nlm:aff id="I2">Department of Control Engineering and Automation, Federal University of Santa Catarina, Rua Pomerode 710, 89065-300 Blumenau, SC, Brazil</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Cansian, Adriano M" sort="Cansian, Adriano M" uniqKey="Cansian A" first="Adriano M." last="Cansian">Adriano M. Cansian</name>
<affiliation><nlm:aff id="I1">Department of Computer Science and Statistics, Sao Paulo State University, Rua Cristóvão Colombo 2265, 15054-000 São José do Rio Preto, SP, Brazil</nlm:aff>
</affiliation>
</author>
<author><name sortKey="De Souza, Rogeria C G" sort="De Souza, Rogeria C G" uniqKey="De Souza R" first="Rogeria C. G." last="De Souza">Rogeria C. G. De Souza</name>
<affiliation><nlm:aff id="I1">Department of Computer Science and Statistics, Sao Paulo State University, Rua Cristóvão Colombo 2265, 15054-000 São José do Rio Preto, SP, Brazil</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Shiyou, Yang" sort="Shiyou, Yang" uniqKey="Shiyou Y" first="Yang" last="Shiyou">Yang Shiyou</name>
<affiliation><nlm:aff id="I3">College of Electrical Engineering, Zhejiang University, Hangzhou 310027, China</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Machado, Jose M" sort="Machado, Jose M" uniqKey="Machado J" first="José M." last="Machado">José M. Machado</name>
<affiliation><nlm:aff id="I1">Department of Computer Science and Statistics, Sao Paulo State University, Rua Cristóvão Colombo 2265, 15054-000 São José do Rio Preto, SP, Brazil</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">25140318</idno>
<idno type="pmc">4130029</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4130029</idno>
<idno type="RBID">PMC:4130029</idno>
<idno type="doi">10.1155/2014/563016</idno>
<date when="2014">2014</date>
<idno type="wicri:Area/Pmc/Corpus">000B31</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000B31</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">An Efficient Parallel Algorithm for Multiple Sequence Similarities Calculation Using a Low Complexity Method</title>
<author><name sortKey="Marucci, Evandro A" sort="Marucci, Evandro A" uniqKey="Marucci E" first="Evandro A." last="Marucci">Evandro A. Marucci</name>
<affiliation><nlm:aff id="I1">Department of Computer Science and Statistics, Sao Paulo State University, Rua Cristóvão Colombo 2265, 15054-000 São José do Rio Preto, SP, Brazil</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Zafalon, Geraldo F D" sort="Zafalon, Geraldo F D" uniqKey="Zafalon G" first="Geraldo F. D." last="Zafalon">Geraldo F. D. Zafalon</name>
<affiliation><nlm:aff id="I1">Department of Computer Science and Statistics, Sao Paulo State University, Rua Cristóvão Colombo 2265, 15054-000 São José do Rio Preto, SP, Brazil</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Momente, Julio C" sort="Momente, Julio C" uniqKey="Momente J" first="Julio C." last="Momente">Julio C. Momente</name>
<affiliation><nlm:aff id="I1">Department of Computer Science and Statistics, Sao Paulo State University, Rua Cristóvão Colombo 2265, 15054-000 São José do Rio Preto, SP, Brazil</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Neves, Leandro A" sort="Neves, Leandro A" uniqKey="Neves L" first="Leandro A." last="Neves">Leandro A. Neves</name>
<affiliation><nlm:aff id="I1">Department of Computer Science and Statistics, Sao Paulo State University, Rua Cristóvão Colombo 2265, 15054-000 São José do Rio Preto, SP, Brazil</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Valencio, Carlo R" sort="Valencio, Carlo R" uniqKey="Valencio C" first="Carlo R." last="Valêncio">Carlo R. Valêncio</name>
<affiliation><nlm:aff id="I1">Department of Computer Science and Statistics, Sao Paulo State University, Rua Cristóvão Colombo 2265, 15054-000 São José do Rio Preto, SP, Brazil</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Pinto, Alex R" sort="Pinto, Alex R" uniqKey="Pinto A" first="Alex R." last="Pinto">Alex R. Pinto</name>
<affiliation><nlm:aff id="I2">Department of Control Engineering and Automation, Federal University of Santa Catarina, Rua Pomerode 710, 89065-300 Blumenau, SC, Brazil</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Cansian, Adriano M" sort="Cansian, Adriano M" uniqKey="Cansian A" first="Adriano M." last="Cansian">Adriano M. Cansian</name>
<affiliation><nlm:aff id="I1">Department of Computer Science and Statistics, Sao Paulo State University, Rua Cristóvão Colombo 2265, 15054-000 São José do Rio Preto, SP, Brazil</nlm:aff>
</affiliation>
</author>
<author><name sortKey="De Souza, Rogeria C G" sort="De Souza, Rogeria C G" uniqKey="De Souza R" first="Rogeria C. G." last="De Souza">Rogeria C. G. De Souza</name>
<affiliation><nlm:aff id="I1">Department of Computer Science and Statistics, Sao Paulo State University, Rua Cristóvão Colombo 2265, 15054-000 São José do Rio Preto, SP, Brazil</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Shiyou, Yang" sort="Shiyou, Yang" uniqKey="Shiyou Y" first="Yang" last="Shiyou">Yang Shiyou</name>
<affiliation><nlm:aff id="I3">College of Electrical Engineering, Zhejiang University, Hangzhou 310027, China</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Machado, Jose M" sort="Machado, Jose M" uniqKey="Machado J" first="José M." last="Machado">José M. Machado</name>
<affiliation><nlm:aff id="I1">Department of Computer Science and Statistics, Sao Paulo State University, Rua Cristóvão Colombo 2265, 15054-000 São José do Rio Preto, SP, Brazil</nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">BioMed Research International</title>
<idno type="ISSN">2314-6133</idno>
<idno type="eISSN">2314-6141</idno>
<imprint><date when="2014">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><p>With the advance of genomic researches, the number of sequences involved in comparative methods has grown immensely. Among them, there are methods for similarities calculation, which are used by many bioinformatics applications. Due the huge amount of data, the union of low complexity methods with the use of parallel computing is becoming desirable. The <italic>k-mers</italic>
 counting is a very efficient method with good biological results. In this work, the development of a parallel algorithm for multiple sequence similarities calculation using the <italic>k-mers</italic>
 counting method is proposed. Tests show that the algorithm presents a very good scalability and a nearly linear speedup. For 14 nodes was obtained 12x speedup. This algorithm can be used in the parallelization of some multiple sequence alignment tools, such as MAFFT and MUSCLE.</p>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Edgar, Rc" uniqKey="Edgar R">RC Edgar</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zafalon, Gf" uniqKey="Zafalon G">GF Zafalon</name>
</author>
<author><name sortKey="Marucci, Ea" uniqKey="Marucci E">EA Marucci</name>
</author>
<author><name sortKey="Momente, Jc" uniqKey="Momente J">JC Momente</name>
</author>
<author><name sortKey="Amazonas, Jr" uniqKey="Amazonas J">JR Amazonas</name>
</author>
<author><name sortKey="Sato, Lm" uniqKey="Sato L">LM Sato</name>
</author>
<author><name sortKey="Machado, Jm" uniqKey="Machado J">JM Machado</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Deng, X" uniqKey="Deng X">X Deng</name>
</author>
<author><name sortKey="Li, E" uniqKey="Li E">E Li</name>
</author>
<author><name sortKey="Shan, J" uniqKey="Shan J">J Shan</name>
</author>
<author><name sortKey="Chen, W" uniqKey="Chen W">W Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Arcuri, Ha" uniqKey="Arcuri H">HA Arcuri</name>
</author>
<author><name sortKey="Zafalon, Gfd" uniqKey="Zafalon G">GFD Zafalon</name>
</author>
<author><name sortKey="Marucci, Ea" uniqKey="Marucci E">EA Marucci</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Marucci, Ea" uniqKey="Marucci E">EA Marucci</name>
</author>
<author><name sortKey="Zafalon, Gf" uniqKey="Zafalon G">GF Zafalon</name>
</author>
<author><name sortKey="Momente, Jc" uniqKey="Momente J">JC Momente</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Katoh, K" uniqKey="Katoh K">K Katoh</name>
</author>
<author><name sortKey="Misawa, K" uniqKey="Misawa K">K Misawa</name>
</author>
<author><name sortKey="Kuma, K" uniqKey="Kuma K">K Kuma</name>
</author>
<author><name sortKey="Miyata, T" uniqKey="Miyata T">T Miyata</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Katoh, K" uniqKey="Katoh K">K Katoh</name>
</author>
<author><name sortKey="Toh, H" uniqKey="Toh H">H Toh</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Edgar, Rc" uniqKey="Edgar R">RC Edgar</name>
</author>
<author><name sortKey="Batzoglou, S" uniqKey="Batzoglou S">S Batzoglou</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Quinn, Mj" uniqKey="Quinn M">MJ Quinn</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Edgar, Rc" uniqKey="Edgar R">RC Edgar</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Amdahl, Gm" uniqKey="Amdahl G">GM Amdahl</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">Biomed Res Int</journal-id>
<journal-id journal-id-type="iso-abbrev">Biomed Res Int</journal-id>
<journal-id journal-id-type="publisher-id">BMRI</journal-id>
<journal-title-group><journal-title>BioMed Research International</journal-title>
</journal-title-group>
<issn pub-type="ppub">2314-6133</issn>
<issn pub-type="epub">2314-6141</issn>
<publisher><publisher-name>Hindawi Publishing Corporation</publisher-name>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">25140318</article-id>
<article-id pub-id-type="pmc">4130029</article-id>
<article-id pub-id-type="doi">10.1155/2014/563016</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group><article-title>An Efficient Parallel Algorithm for Multiple Sequence Similarities Calculation Using a Low Complexity Method</article-title>
</title-group>
<contrib-group><contrib contrib-type="author"><name><surname>Marucci</surname>
<given-names>Evandro A.</given-names>
</name>
<xref ref-type="aff" rid="I1"><sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author"><contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0003-2384-011X</contrib-id>
<name><surname>Zafalon</surname>
<given-names>Geraldo F. D.</given-names>
</name>
<xref ref-type="aff" rid="I1"><sup>1</sup>
</xref>
<xref ref-type="corresp" rid="cor1">*</xref>
</contrib>
<contrib contrib-type="author"><contrib-id contrib-id-type="orcid" authenticated="false">http://orcid.org/0000-0003-3355-0343</contrib-id>
<name><surname>Momente</surname>
<given-names>Julio C.</given-names>
</name>
<xref ref-type="aff" rid="I1"><sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Neves</surname>
<given-names>Leandro A.</given-names>
</name>
<xref ref-type="aff" rid="I1"><sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Valêncio</surname>
<given-names>Carlo R.</given-names>
</name>
<xref ref-type="aff" rid="I1"><sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Pinto</surname>
<given-names>Alex R.</given-names>
</name>
<xref ref-type="aff" rid="I2"><sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Cansian</surname>
<given-names>Adriano M.</given-names>
</name>
<xref ref-type="aff" rid="I1"><sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>de Souza</surname>
<given-names>Rogeria C. G.</given-names>
</name>
<xref ref-type="aff" rid="I1"><sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Shiyou</surname>
<given-names>Yang</given-names>
</name>
<xref ref-type="aff" rid="I3"><sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Machado</surname>
<given-names>José M.</given-names>
</name>
<xref ref-type="aff" rid="I1"><sup>1</sup>
</xref>
</contrib>
</contrib-group>
<aff id="I1"><sup>1</sup>
Department of Computer Science and Statistics, Sao Paulo State University, Rua Cristóvão Colombo 2265, 15054-000 São José do Rio Preto, SP, Brazil</aff>
<aff id="I2"><sup>2</sup>
Department of Control Engineering and Automation, Federal University of Santa Catarina, Rua Pomerode 710, 89065-300 Blumenau, SC, Brazil</aff>
<aff id="I3"><sup>3</sup>
College of Electrical Engineering, Zhejiang University, Hangzhou 310027, China</aff>
<author-notes><corresp id="cor1">*Geraldo F. D. Zafalon: <email>zafalon@sjrp.unesp.br</email>
</corresp>
<fn fn-type="other"><p>Academic Editor: Tzong-Yi Lee</p>
</fn>
</author-notes>
<pub-date pub-type="ppub"><year>2014</year>
</pub-date>
<pub-date pub-type="epub"><day>22</day>
<month>7</month>
<year>2014</year>
</pub-date>
<volume>2014</volume>
<elocation-id>563016</elocation-id>
<history><date date-type="received"><day>12</day>
<month>2</month>
<year>2014</year>
</date>
<date date-type="accepted"><day>2</day>
<month>7</month>
<year>2014</year>
</date>
</history>
<permissions><copyright-statement>Copyright © 2014 Evandro A. Marucci et al.</copyright-statement>
<copyright-year>2014</copyright-year>
<license xlink:href="https://creativecommons.org/licenses/by/3.0/"><license-p>This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract><p>With the advance of genomic researches, the number of sequences involved in comparative methods has grown immensely. Among them, there are methods for similarities calculation, which are used by many bioinformatics applications. Due the huge amount of data, the union of low complexity methods with the use of parallel computing is becoming desirable. The <italic>k-mers</italic>
 counting is a very efficient method with good biological results. In this work, the development of a parallel algorithm for multiple sequence similarities calculation using the <italic>k-mers</italic>
 counting method is proposed. Tests show that the algorithm presents a very good scalability and a nearly linear speedup. For 14 nodes was obtained 12x speedup. This algorithm can be used in the parallelization of some multiple sequence alignment tools, such as MAFFT and MUSCLE.</p>
</abstract>
</article-meta>
</front>
<body><sec id="sec1"><title>1. Introduction</title>
<p>The use of sequence comparison methods has been remarkably growing in recent years, in response to the data expansion of genomic research. Consequently, methods that reduce the execution time are fundamental to the progress of this area. Many efforts have been made concerning their optimization [<xref rid="B1" ref-type="bibr">1</xref>
, <xref rid="B2" ref-type="bibr">2</xref>
].</p>
<p>With the increase in number of multiple sequences alignments problems, the development of methods which have a lower computational complexity also increases. Nevertheless, just the creation of low complexity methods was not enough to work with high volume of data [<xref rid="B3" ref-type="bibr">3</xref>
, <xref rid="B4" ref-type="bibr">4</xref>
]. The interest in parallel computing has grown and, hence, several parallel methods were created and embedded in many sequence comparison tools [<xref rid="B5" ref-type="bibr">5</xref>
].</p>
<p>In general, the sequence comparison starts with the similarity calculation between pairs of sequences. It occurs through methods which require or not a prior alignment, varying the algorithm complexity and the level of biological accuracy. The alignment free methods correspond to a class of low complexity methods, which have a low temporal and spatial complexity in relation to the methods which requires a prior alignment. Moreover, they also keep a high level of biological accuracy for divergent sequences. For this reason, they are becoming increasingly important in computational biology.</p>
<p>The similarity calculation methods shall be classified into two main categories: methods based on the counting of words and methods that do not involve such a counting. Among the methods based on the counting of words there is the<italic> k-mers</italic>
 counting method, which was proposed by Katoh et al. [<xref rid="B6" ref-type="bibr">6</xref>
]. This method counts the number of<italic> k-mers</italic>
 (words of size<italic> k</italic>
) shared by a pair of sequences, using it as an approximation of the similarity level. It also uses a different alphabet, built on the basis of statistical data, maximizing the biological accuracy in relation to the previously proposed methods.</p>
<p>The<italic> k-mers</italic>
 counting method was implemented in some important multiple sequences alignment tools, such as MAFFT [<xref rid="B6" ref-type="bibr">6</xref>
, <xref rid="B7" ref-type="bibr">7</xref>
] and MUSCLE [<xref rid="B1" ref-type="bibr">1</xref>
], and to the extent of our knowledge there is no other work in the literature that has developed and tested a parallel approach, specifically of this method. Once they are important tools, giving excellent results, in both biological accuracy and computational complexity [<xref rid="B8" ref-type="bibr">8</xref>
], the development of a parallel algorithm for multiple sequence similarities calculation using the<italic> k-mers</italic>
 counting method is very useful. Although a parallel MUSCLE exists for shared memory systems [<xref rid="B3" ref-type="bibr">3</xref>
], it does not include the parallelization of this stage. The parallel algorithm can be used in the development of any parallel tool that requires it in one of your stages. Tree construction or progressive multiple alignment tools, in general, are some examples.</p>
<p>This paper presents the<italic> k-mers</italic>
 counting method with the proposed similarity calculation parallel algorithm for multiple sequences through this method. The algorithm was developed for distributed memory parallel systems, using the library MPI [<xref rid="B9" ref-type="bibr">9</xref>
].</p>
</sec>
<sec id="sec2"><title>2. Materials and Methods</title>
<sec id="sec2.1"><title>2.1. Word Counting Methods</title>
<p>In general, word counting methods start with the mapping from sequences to vectors which store the length of each word. These words are subsequences of length <italic>k</italic>
, also known as a <italic>k</italic>
-tuple.</p>
<p>In order to understand the behavior of these methods, it is interesting to perform a review about some words statistical concepts. Then, consider a sequence <italic>X</italic>
, of length <italic>n</italic>
, defined as a segment of <italic>n</italic>
 symbols of a finite alphabet <italic>A</italic>
, of length <italic>r</italic>
.</p>
<p>A segment of <italic>k</italic>
 symbols, with <italic>k</italic>
 ≤ <italic>n</italic>
, is a <italic>k</italic>
-tuple. The set <italic>W</italic>
<sub><italic>k</italic>
</sub>
 consists of all possible <italic>k</italic>
-tuples from the alphabet set <italic>A</italic>
 and has <italic>N</italic> elements:
<disp-formula id="eq1"><label>(1)</label>
<mml:math id="M1"><mml:mtable><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>W</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow><mml:mo>{</mml:mo>
<mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn mathvariant="normal">1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub><mml:mrow><mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn mathvariant="normal">2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub><mml:mrow><mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr><mml:mtd><mml:mi>N</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup><mml:mrow><mml:mi>r</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>Then, we count the number of <italic>k</italic>
-tuples of <italic>W</italic>
<sub><italic>k</italic>
</sub>
 which appear in the sequence <italic>X</italic>
. Computationally, this count is normally made moving a window of size <italic>k</italic>
 through the sequence, from the position 1 until the position <italic>n</italic>
 − <italic>k</italic>
 + 1. The vector <italic>c</italic>
<sub><italic>k</italic>
</sub>
<sup><italic>X</italic>
</sup>
 is responsible for the storage of the number of occurrences of <italic>k</italic>
-tuples in the sequence <italic>X</italic>:
<disp-formula id="eq2"><label>(2)</label>
<mml:math id="M2"><mml:mtable><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>k</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>X</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mrow><mml:mo>(</mml:mo>
<mml:mrow><mml:msubsup><mml:mrow><mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn mathvariant="normal">1</mml:mn>
</mml:mrow>
<mml:mrow><mml:mi>X</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup><mml:mrow><mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn mathvariant="normal">2</mml:mn>
</mml:mrow>
<mml:mrow><mml:mi>X</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:msubsup><mml:mrow><mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>X</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>A frequency vector <italic>f</italic>
<sub><italic>k</italic>
</sub>
<sup><italic>X</italic>
</sup>
 can then be gotten from the relative quantity of each <italic>k</italic>-tuple:
<disp-formula id="eq3"><label>(3)</label>
<mml:math id="M3"><mml:mtable><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>k</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>X</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mfrac><mml:mrow><mml:msubsup><mml:mrow><mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>k</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>X</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mrow><mml:mrow><mml:msubsup><mml:mo stretchy="false">∑</mml:mo>
<mml:mrow><mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn mathvariant="normal">1</mml:mn>
</mml:mrow>
<mml:mrow><mml:mi>N</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:msubsup><mml:mrow><mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>X</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mfrac>
<mml:mo>≡</mml:mo>
<mml:msubsup><mml:mrow><mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>X</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mfrac><mml:mrow><mml:msubsup><mml:mrow><mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>X</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mrow><mml:mi>n</mml:mi>
<mml:mo>−</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn mathvariant="normal">1</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>As an example of the use of these structures, imagine a DNA sequence, where <italic>A</italic>
 = {<italic>A</italic>
, <italic>T</italic>
, <italic>C</italic>
, <italic>G</italic>
} and <italic>r</italic>
 = 4. For <italic>k</italic>
 = 3, <italic>ATC</italic>
 and <italic>AAA</italic>
 are <italic>k</italic>
-tuple belonging to the set <italic>W</italic>
<sub>3</sub>
. For the sequence <italic>X</italic>
 = <italic>ATATAC</italic>
, where <italic>n</italic>
 = 6, the counting and frequency vectors (<italic>c</italic>
<sub>3</sub>
<sup><italic>X</italic>
</sup>
 and <italic>f</italic>
<sub>3</sub>
<sup><italic>X</italic>
</sup>
, resp.) are constructed as <italic>k</italic>
-tuples of all <italic>W</italic>
<sub>3</sub>
 which are identified in the sequence <italic>X</italic>
. The sequence <italic>X</italic>
 is travelled in a window of size <italic>k</italic>
 = 3. The word within each window is compared with the words of <italic>W</italic>
<sub>3</sub>
. In this case, <italic>n</italic>
 − <italic>k</italic>
 + 1 = 4 comparisons are necessary (<italic>ATA</italic>
, <italic>TAT</italic>
, <italic>ATA</italic>
, <italic>TAC</italic>):
<disp-formula id="eq4"><label>(4)</label>
<mml:math id="M4"><mml:mtable><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>W</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn mathvariant="normal">3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow><mml:mo>{</mml:mo>
<mml:mrow><mml:mi>A</mml:mi>
<mml:mi>T</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>T</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>T</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>T</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>C</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>A</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn mathvariant="normal">3</mml:mn>
</mml:mrow>
<mml:mrow><mml:mi>X</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mrow><mml:mo>(</mml:mo>
<mml:mrow><mml:mn mathvariant="normal">2,1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn mathvariant="normal">1,0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn mathvariant="normal">3</mml:mn>
</mml:mrow>
<mml:mrow><mml:mi>X</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mrow><mml:mo>(</mml:mo>
<mml:mrow><mml:mn mathvariant="normal">0.5,0.25,0.25,0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
</sec>
<sec id="sec2.2"><title>2.2. The<italic> k-mers</italic>
 Counting Method</title>
<p>In the<italic> k-mers</italic>
 counting method, we use the term<italic> k-mer</italic>
 to represent the words, or <italic>k</italic>
-tuples. This method presents a considerably greater speed in relation to conventional methods, which require alignment [<xref rid="B10" ref-type="bibr">10</xref>
]. Its algorithm, implemented to determine the number of<italic> k-mers</italic>
 shared by two sequences, is <italic>O</italic>
(<italic>n</italic>
), for sequences of size <italic>n</italic>
. Differently, the conventional algorithms, which require alignment, are <italic>O</italic>
(<italic>n</italic>
<sup>2</sup>
).</p>
<p>This algorithm uses, in general, a little different alphabet. In most cases, the alphabet used is a variation of the default alphabet. Known by compressed alphabets, these alphabets contain symbols that denote classes that correspond to two or more different types of residues (each residue is represented by a letter).</p>
<p>For amino acids sequences, a compressed alphabet <italic>C</italic>
 of size <italic>N</italic>
 is a partition of the default amino acids alphabet <italic>A</italic>
, which contains 20 letters, in <italic>N</italic>
 disjoined classes containing similar amino acids. <xref ref-type="table" rid="tab1"> Table 1</xref>
, extracted from [<xref rid="B10" ref-type="bibr">10</xref>
], shows some examples of compressed alphabets.</p>
<p>With the use of compressed alphabets the identity is highly conserved. Pairs of related sequences have always a greater or equal identity and, therefore, more<italic> k-mers</italic>
 in common in an alphabet smaller than the default alphabet. An example of this characteristic can be seen in <xref ref-type="table" rid="tab2">Table 2</xref>
.</p>
<p>In <xref ref-type="table" rid="tab2">Table 2</xref>
, the upper and lower alignment are the same. The difference is that the first uses <italic>A</italic>
, as default amino acids alphabet, while the second uses the compressed alphabet SE-V(10), whose members of the classes are represented by their first letters in alphabetical order—<italic>I</italic>
, <italic>L</italic>
, <italic>M</italic>
, and <italic>V</italic>
 are shown as <italic>I</italic>
, for example. The columns which are fully conserved are indicated with capital letters and with an asterisk below. The number of conserved columns (<italic>k</italic>
 = 1) increased from 12 in <italic>A</italic>
 to 19 in SE-V(10). For <italic>k</italic>
 = 3, the number of fully conserved<italic> k-mers</italic>
 increased from 4 in <italic>A</italic>
 to 10 in SE-V(10) and, for <italic>k</italic>
 = 4, from 3 to 8.</p>
<p>The choice of the alphabet and the value of <italic>k</italic>
 is based on statistics and has strong impact on the number of conserved identities. If the alphabet is selected, which means that there is a high probability of residues replacements of the same class and a low probability of residues replacements from distinct classes, then we probably have an increase in the number of identities detected. Moreover, the value of <italic>k</italic>
 confines this increasing in regions of continuous identity. As the sequences differ, the number of conserved<italic> k-mers</italic>
 is reduced, reaching a limit compared to the expected number of no related sequences. The use of compressed alphabets increases the likelihood of this limit be reached at a greater evolutionary distance. Subtle choices of the size of this alphabet and the value of <italic>k</italic>
 can provide a better measure of similarity.</p>
<p>The following equation shows how we calculate the similarity between sequences <italic>X</italic>
 and <italic>Y</italic>
 by the<italic> k-mers</italic> counting method:
<disp-formula id="eq5"><label>(5)</label>
<mml:math id="M5"><mml:mtable><mml:mtr><mml:mtd><mml:mi>F</mml:mi>
<mml:mrow><mml:mo>(</mml:mo>
<mml:mrow><mml:mi>X</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac><mml:mrow><mml:mrow><mml:msub><mml:mo stretchy="false">∑</mml:mo>
<mml:mrow><mml:mi>τ</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow><mml:mi>min</mml:mi>
<mml:mo>⁡</mml:mo>
<mml:mrow><mml:mo>[</mml:mo>
<mml:mrow><mml:msub><mml:mrow><mml:mi>n</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>X</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow><mml:mo>(</mml:mo>
<mml:mrow><mml:mi>τ</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:msub><mml:mrow><mml:mi>n</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>Y</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow><mml:mo>(</mml:mo>
<mml:mrow><mml:mi>τ</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mrow><mml:mrow><mml:mo>[</mml:mo>
<mml:mrow><mml:mi>min</mml:mi>
<mml:mo>⁡</mml:mo>
<mml:mrow><mml:mo>(</mml:mo>
<mml:mrow><mml:msub><mml:mrow><mml:mi>L</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>X</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub><mml:mrow><mml:mi>L</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>Y</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>−</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn mathvariant="normal">1</mml:mn>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>Here <italic>τ</italic>
 is a<italic> k-mer</italic>
, <italic>L</italic>
<sub><italic>X</italic>
</sub>
 and <italic>L</italic>
<sub><italic>Y</italic>
</sub>
 are the sequences lengths, and <italic>n</italic>
<sub><italic>X</italic>
</sub>
(<italic>τ</italic>
) and <italic>n</italic>
<sub><italic>Y</italic>
</sub>
(<italic>τ</italic>
) are the number of times <italic>τ</italic>
 appears in <italic>X</italic>
 and <italic>Y</italic>
, respectively.</p>
</sec>
<sec id="sec2.3"><title>2.3. Multiple Sequence Similarities Calculation</title>
<p>In any application that performs a comparison between multiple sequences, either by multiple alignment or just by the construction of phylogenetic trees, we perform the similarity calculation in many independent sequences. Thus, the <italic>F</italic>
 value is obtained for all pairs of sequences involved in the processing. As <italic>F</italic>
(<italic>X</italic>
, <italic>Y</italic>
) equals <italic>F</italic>
(<italic>Y</italic>
, <italic>X</italic>
), this value is calculated once, by two nested loops, as</p>
<p><monospace>for (i = 1; i < num_seq; ++i) </monospace>
</p>
<p><monospace>for (j = 0; j < i; ++i) </monospace>
</p>
<p><monospace>M[i,j] = F(i,j) </monospace>
</p>
<p>All obtained values of <italic>F</italic>
 are, therefore, stored in a triangular matrix <italic>M</italic>
.</p>
</sec>
<sec id="sec2.4"><title>2.4. Parallel Algorithm</title>
<p>In an application, the amount of sequences in comparison can be huge, making its implementation unfeasible in a single machine, even with the use of low computational complexity methods, as the<italic> k-mers</italic>
 counting method. For this reason, we propose a parallel algorithm that calculates the similarities of multiple pairs of sequences using the<italic> k-mers</italic>
 counting method.</p>
<p>This algorithm dynamically divides the computation among existing processors through a master-slave approach. This parallelism is performed distributing the similarity calculation of sequence pairs to available slaves. It is possible because each similarity pair calculation is independent of other calculations of similarity pair.</p>
<p>Initially, the master sends all sequences by broadcasting to all slaves. Using broadcast the master can reduce the overhead with the messages exchange, decreasing the communication time between processors.</p>
<p>The tasks distribution is based on the processor identifier, assigning to each slave the calculation of specific lines of the triangular matrix of similarities. The way this matrix is obtained is exemplified in <xref ref-type="fig" rid="fig1">Figure 1</xref>
. Each slave initially calculates the corresponding line of the identifier, in a <italic>p</italic>
-step loop, where <italic>p</italic>
 is the number of processors. In <xref ref-type="fig" rid="fig1">Figure 1</xref>
, we have seven sequences and, hence, seven lines in the matrix. The first slave is responsible for the calculation of the lines 1, 4, and 7, the second slave by lines 2 and 5 and the third slave by lines 3 and 6. The gray values are the obtained results in other slaves which will only be joined in the master for the creation of the similarities matrix.</p>
<p>During the matrix lines calculation, each slave stores the results of all lines in a single vector (<monospace>SimVect</monospace>
) which is sent to master. The master receives all slave vectors and, from these vectors and the identifier the slave has sent, it builds the similarities matrix. From the previous example, the obtained similarities matrix is shown in <xref ref-type="fig" rid="fig2">Figure 2</xref>
.</p>
<p>In <xref ref-type="fig" rid="fig3">Figure 3</xref>
 is showed the flowchart of the algorithm which performs the task in Figures <xref ref-type="fig" rid="fig1">1</xref>
 and <xref ref-type="fig" rid="fig2">2</xref>
.</p>
</sec>
</sec>
<sec id="sec3"><title>3. Results</title>
<sec id="sec3.1"><title>3.1. Performance Evaluation</title>
<p>Many tests were performed to measure the performance of the proposed algorithms. The datasets used in these experiments have differences only on the number of sequences to be aligned and their lengths. The sequences used were extracted from the NCBI database (<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov">http://www.ncbi.nlm.nih.gov/</ext-link>
). For each performed test, we describe specific information of the dataset used, as the number of sequences and the average and maximum length of the sequences.</p>
<p>Tests were performed on a Beowulf cluster running under Linux Debian. The Beowulf cluster consists of 15 nodes, each one composed by one AMD Athlon XP 2100+ processor with 1 GB of RAM memory. The nodes are connected with a dedicated 10/100 Fast Ethernet switch. The tests were performed with 1, 2, 4, 8, and 15 nodes. The run times were measured by executing them in stand alone mode, to ensure exclusive use of the communication and processors CPU and memory.</p>
<p>In order to verify the proposed algorithm performance, we executed tests with four different datasets. For each dataset, we verified the algorithm scalability when executed in a crescent machine number.</p>
<p>The first three graphics ((a), (b) and (c)) illustrated in <xref ref-type="fig" rid="fig4">Figure 4</xref>
 show an almost linear speedup for tests with more than 500 sequences. Notice that the minimum system set for the parallel algorithm execution consists of two nodes, because such algorithm has a master-slave model. One machine (the master) is responsible for data management. The performance difference between the sequential algorithm and the parallel one, when run on a minimum system, is almost zero. For this reason, we only showed the tests performed with the parallel algorithm from the minimum system of two nodes until the maximum number of nodes in the cluster.</p>
<p>The graphic (d), also in <xref ref-type="fig" rid="fig4">Figure 4</xref>
, illustrates also a constant performance, regardless of the number of machines, for an entry with few sequences. In this case, the sequential implementation of the algorithm is so fast for that input (approximately 1 second) that the speedup achieved with tasks division is close to the time spent with messages exchanges.</p>
<p>Comparing the minimum system (2 nodes) with the maximum system (15 nodes) we have an increase of 14 nodes and, approximately, a 12x speedup. Therefore, it can be noticed that the parallel algorithm presents excellent results with a nearly linear speedup.</p>
</sec>
</sec>
<sec id="sec4"><title>4. Discussion</title>
<p>In this work, we presented a parallel strategy to calculate similarities between multiple pairs of sequences using the<italic> k-mers</italic>
 couting method, a low computational complexity method with good biological results. This calculation is used, for example, in multiple sequence alignment tools. The proposed parallel algorithm has been implemented for distributed memory systems, due to the wide use of Beowulf clusters in genomic research laboratories.</p>
<p>The tests performed show that the algorithm presents a good scalability and a nearly linear speedup. With the use of 14 processing nodes (slaves), the system achieved a 12x speedup. This speedup is justified by the total independence of the scheduled tasks and by the good load balancing obtained with the triangular matrix lines distribution.</p>
<p>Additionally, the communication cost is minimized because the computed data in slaves are sent to master in a single message. Considering the 12x speedup achieved and using Amdahl's law [<xref rid="B11" ref-type="bibr">11</xref>
] we can estimate that about 1.5% of the processes in the system are unparallelized, which reinforces the need of obtained optimization with the concentrated communication, avoiding message overload.</p>
</sec>
</body>
<back><ack><title>Acknowledgments</title>
<p>The authors would like to thank all of their collaborators and institutions for the support to the development of the present work. This work was partially supported by the São Paulo Research Foundation (FAPESP, Brazil) under Grant no. 06/59592-0.</p>
</ack>
<sec sec-type="conflict"><title>Conflict of Interests</title>
<p>The authors declare that there is no conflict of interests regarding the publication of this paper.</p>
</sec>
<ref-list><ref id="B1"><label>1</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Edgar</surname>
<given-names>RC</given-names>
</name>
</person-group>
<article-title>MUSCLE: a multiple sequence alignment method with reduced time and space complexity</article-title>
<source><italic>BMC Bioinformatics</italic>
</source>
<year>2004</year>
<volume>5, article 113</volume>
<pub-id pub-id-type="other">2-s2.0-13244255415</pub-id>
</element-citation>
</ref>
<ref id="B2"><label>2</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Zafalon</surname>
<given-names>GF</given-names>
</name>
<name><surname>Marucci</surname>
<given-names>EA</given-names>
</name>
<name><surname>Momente</surname>
<given-names>JC</given-names>
</name>
<name><surname>Amazonas</surname>
<given-names>JR</given-names>
</name>
<name><surname>Sato</surname>
<given-names>LM</given-names>
</name>
<name><surname>Machado</surname>
<given-names>JM</given-names>
</name>
</person-group>
<article-title>Improvements in the score matrix calculation method using parallel score estimating algorithm</article-title>
<source><italic>Journal of Biophysical Chemistry</italic>
</source>
<year>2013</year>
<volume>4</volume>
<issue>2</issue>
<fpage>47</fpage>
<lpage>51</lpage>
</element-citation>
</ref>
<ref id="B3"><label>3</label>
<element-citation publication-type="confproc"><person-group person-group-type="author"><name><surname>Deng</surname>
<given-names>X</given-names>
</name>
<name><surname>Li</surname>
<given-names>E</given-names>
</name>
<name><surname>Shan</surname>
<given-names>J</given-names>
</name>
<name><surname>Chen</surname>
<given-names>W</given-names>
</name>
</person-group>
<article-title>Parallel implementation and performance characterization of muscle</article-title>
<conf-name>Proceedings of the 20th IEEE Parallel and Distributed Processing Symposium (IPDPS '06)</conf-name>
<conf-date>2006</conf-date>
</element-citation>
</ref>
<ref id="B4"><label>4</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Arcuri</surname>
<given-names>HA</given-names>
</name>
<name><surname>Zafalon</surname>
<given-names>GFD</given-names>
</name>
<name><surname>Marucci</surname>
<given-names>EA</given-names>
</name>
<etal></etal>
</person-group>
<article-title>SKPDB: a structural database of shikimate pathway enzymes</article-title>
<source><italic>BMC Bioinformatics</italic>
</source>
<year>2010</year>
<volume>11, article 12</volume>
<fpage>1</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="other">2-s2.0-77349112482</pub-id>
<pub-id pub-id-type="pmid">20043860</pub-id>
</element-citation>
</ref>
<ref id="B5"><label>5</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Marucci</surname>
<given-names>EA</given-names>
</name>
<name><surname>Zafalon</surname>
<given-names>GF</given-names>
</name>
<name><surname>Momente</surname>
<given-names>JC</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Using threads to overcome synchronization delays in parallel multiple progressive alignment algorithms</article-title>
<source><italic>American Journal of Bioinformatics</italic>
</source>
<year>2012</year>
<volume>1</volume>
<issue>1</issue>
<fpage>50</fpage>
<lpage>63</lpage>
</element-citation>
</ref>
<ref id="B6"><label>6</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Katoh</surname>
<given-names>K</given-names>
</name>
<name><surname>Misawa</surname>
<given-names>K</given-names>
</name>
<name><surname>Kuma</surname>
<given-names>K</given-names>
</name>
<name><surname>Miyata</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform</article-title>
<source><italic>Nucleic Acids Research</italic>
</source>
<year>2002</year>
<volume>30</volume>
<issue>14</issue>
<fpage>3059</fpage>
<lpage>3066</lpage>
<pub-id pub-id-type="other">2-s2.0-0037100671</pub-id>
<pub-id pub-id-type="pmid">12136088</pub-id>
</element-citation>
</ref>
<ref id="B7"><label>7</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Katoh</surname>
<given-names>K</given-names>
</name>
<name><surname>Toh</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>Parallelization of the MAFFT multiple sequence alignment program</article-title>
<source><italic>Bioinformatics</italic>
</source>
<year>2010</year>
<volume>26</volume>
<issue>15</issue>
<fpage>1899</fpage>
<lpage>1900</lpage>
<pub-id pub-id-type="publisher-id">btq224</pub-id>
<pub-id pub-id-type="other">2-s2.0-77955019084</pub-id>
<pub-id pub-id-type="pmid">20427515</pub-id>
</element-citation>
</ref>
<ref id="B8"><label>8</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Edgar</surname>
<given-names>RC</given-names>
</name>
<name><surname>Batzoglou</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Multiple sequence alignment</article-title>
<source><italic>Current Opinion in Structural Biology</italic>
</source>
<year>2006</year>
<volume>16</volume>
<issue>3</issue>
<fpage>368</fpage>
<lpage>373</lpage>
<pub-id pub-id-type="other">2-s2.0-33744811392</pub-id>
<pub-id pub-id-type="pmid">16679011</pub-id>
</element-citation>
</ref>
<ref id="B9"><label>9</label>
<element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Quinn</surname>
<given-names>MJ</given-names>
</name>
</person-group>
<source><italic>Parallel Programming in C with MPI and OpenMP</italic>
</source>
<year>2003</year>
<edition>1st edition</edition>
<publisher-loc>New York, NY, USA</publisher-loc>
<publisher-name>McGraw-Hill Education</publisher-name>
</element-citation>
</ref>
<ref id="B10"><label>10</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Edgar</surname>
<given-names>RC</given-names>
</name>
</person-group>
<article-title>Local homology recognition and distance measures in linear time using compressed amino acid alphabets</article-title>
<source><italic>Nucleic Acids Research</italic>
</source>
<year>2004</year>
<volume>32</volume>
<issue>1</issue>
<fpage>380</fpage>
<lpage>385</lpage>
<pub-id pub-id-type="other">2-s2.0-1242320272</pub-id>
<pub-id pub-id-type="pmid">14729922</pub-id>
</element-citation>
</ref>
<ref id="B11"><label>11</label>
<element-citation publication-type="confproc"><person-group person-group-type="author"><name><surname>Amdahl</surname>
<given-names>GM</given-names>
</name>
</person-group>
<article-title>Validity of the single processor approach to achieving large scale computing capabilities</article-title>
<conf-name>Proceedings of the Spring Joint Computer Conference (AFIPS '67)</conf-name>
<conf-date>1967</conf-date>
</element-citation>
</ref>
</ref-list>
</back>
<floats-group><fig id="fig1" orientation="portrait" position="float"><label>Figure 1</label>
<caption><p>Example of how the similarities matrix calculation is distributed between the slaves.</p>
</caption>
<graphic xlink:href="BMRI2014-563016.001"></graphic>
</fig>
<fig id="fig2" orientation="portrait" position="float"><label>Figure 2</label>
<caption><p>Similarities matrix that is built by the master processor.</p>
</caption>
<graphic xlink:href="BMRI2014-563016.002"></graphic>
</fig>
<fig id="fig3" orientation="portrait" position="float"><label>Figure 3</label>
<caption><p>Flowchart of the parallel algorithm for multiple sequence similarities calculation.</p>
</caption>
<graphic xlink:href="BMRI2014-563016.003"></graphic>
</fig>
<fig id="fig4" orientation="portrait" position="float"><label>Figure 4</label>
<caption><p>Processing time, in seconds, of the parallel algorithm for 2, 4, 8, and 15 nodes executing with four different datasets.</p>
</caption>
<graphic xlink:href="BMRI2014-563016.004"></graphic>
</fig>
<table-wrap id="tab1" orientation="portrait" position="float"><label>Table 1</label>
<caption><p>Examples of compressed alphabets.</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left" rowspan="1" colspan="1">Alphabet (<italic>N</italic>
) </th>
<th align="left" rowspan="1" colspan="1">Classes </th>
</tr>
</thead>
<tbody><tr><td align="left" rowspan="1" colspan="1">SE-B (14) </td>
<td align="left" rowspan="1" colspan="1">A, C, D, EQ, FY, G, H, IV, KR, LM, N, P, ST, W </td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">SE-B (10) </td>
<td align="left" rowspan="1" colspan="1">AST, C, DN, EQ, FY, G, HW, ILMV, KR, P </td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">SE-V (10) </td>
<td align="left" rowspan="1" colspan="1">AST, C, DEN, FY, G, H, ILMV, KQR, P, W </td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Li-A (10) </td>
<td align="left" rowspan="1" colspan="1">AC, DE, FWY, G, HN, IV, KQR, LM, P, ST </td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Li-B (10) </td>
<td align="left" rowspan="1" colspan="1">AST, C, DEQ, FWY, G, HN, IV, KR, LM, P </td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Solis-D (10) </td>
<td align="left" rowspan="1" colspan="1">AM, C, DNS, EKQR, F, GP, HT, IV, LY, W </td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Solis-G (10) </td>
<td align="left" rowspan="1" colspan="1">AEFIKLMQRVW, C, D, G, H, N, P, S, T, Y </td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Murphy (10) </td>
<td align="left" rowspan="1" colspan="1">A, C, DENQ, FWY, G, H, ILMV, KR, P, ST </td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">SE-B (8) </td>
<td align="left" rowspan="1" colspan="1">AST, C, DHN, EKQR, FWY, G, ILMV, P </td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">SE-B (6) </td>
<td align="left" rowspan="1" colspan="1">AST, CP, DEHKNQR, FWY, G, ILMV </td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Dayhoff (6) </td>
<td align="left" rowspan="1" colspan="1">AGPST, C, DENQ, FWY, HKR, ILMV </td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="tab2" orientation="portrait" position="float"><label>Table 2</label>
<caption><p>Comparison between the use of the alphabet <italic>A</italic>
 and the compressed alphabet SE-V (10).</p>
</caption>
<table frame="hsides" rules="groups"><tbody><tr><td align="left" rowspan="1" colspan="1">Seq1: sAaNiLvGEnlvcKvaDFGLARl </td>
<td align="center" rowspan="1" colspan="1"> </td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Seq2: aArNiLvGEnyicKvaDFGLARl </td>
<td align="center" rowspan="1" colspan="1"> </td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Seq3: <inline-formula><mml:math id="M6"><mml:munder><mml:mrow><mml:mtext>a</mml:mtext>
</mml:mrow>
<mml:mo> </mml:mo>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>A</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>r</mml:mtext>
</mml:mrow>
<mml:mo> </mml:mo>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>N</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>v</mml:mtext>
</mml:mrow>
<mml:mo> </mml:mo>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>L</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>i</mml:mtext>
</mml:mrow>
<mml:mo> </mml:mo>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>G</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>E</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>d</mml:mtext>
</mml:mrow>
<mml:mo> </mml:mo>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>n</mml:mtext>
</mml:mrow>
<mml:mo> </mml:mo>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>v</mml:mtext>
</mml:mrow>
<mml:mo> </mml:mo>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>a</mml:mtext>
</mml:mrow>
<mml:mo> </mml:mo>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>K</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>i</mml:mtext>
</mml:mrow>
<mml:mo> </mml:mo>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>c</mml:mtext>
</mml:mrow>
<mml:mo> </mml:mo>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>D</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>F</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>G</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>L</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>A</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>R</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>v</mml:mtext>
</mml:mrow>
<mml:mo> </mml:mo>
</mml:munder>
</mml:math>
</inline-formula>
</td>
<td align="center" rowspan="1" colspan="1"> </td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Using the default amino acids alphabet </td>
<td align="center" rowspan="1" colspan="1"> </td>
</tr>
<tr><td align="center" colspan="2" rowspan="1"><hr></hr>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Seq1: AAaNILIGENlIcKIaDFGLARI </td>
<td align="center" rowspan="1" colspan="1"> </td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Seq2: AArNILIGENyIcKIaDFGLARI </td>
<td align="center" rowspan="1" colspan="1"> </td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Seq3: <inline-formula><mml:math id="M7"><mml:munder><mml:mrow><mml:mtext>A</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>A</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>r</mml:mtext>
</mml:mrow>
<mml:mo> </mml:mo>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>N</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>I</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>L</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>I</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>G</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>E</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>N</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>n</mml:mtext>
</mml:mrow>
<mml:mo> </mml:mo>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>I</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>a</mml:mtext>
</mml:mrow>
<mml:mo> </mml:mo>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>K</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>I</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>c</mml:mtext>
</mml:mrow>
<mml:mo> </mml:mo>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>D</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>F</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>G</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>L</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>A</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>R</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
<mml:munder><mml:mrow><mml:mtext>I</mml:mtext>
</mml:mrow>
<mml:mi>∗</mml:mi>
</mml:munder>
</mml:math>
</inline-formula>
</td>
<td align="center" rowspan="1" colspan="1"> </td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">Using the compressed alphabet SE-V (10) </td>
<td align="center" rowspan="1" colspan="1"> </td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"> (each class member is represented by the first letter in alphabetical order) </td>
<td align="center" rowspan="1" colspan="1"> </td>
</tr>
</tbody>
</table>
</table-wrap>
</floats-group>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000B31 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000B31 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:4130029
   |texte=   An Efficient Parallel Algorithm for Multiple Sequence Similarities Calculation Using a Low Complexity Method
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:25140318" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration MERS

An Efficient Parallel Algorithm for Multiple Sequence Similarities Calculation Using a Low Complexity Method

An Efficient Parallel Algorithm for Multiple Sequence Similarities Calculation Using a Low Complexity Method

Source :

Abstract

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki