MersV1, Pmc, Corpus, bibRecord, 000090

***** Acces problem to record *****\

Identifieur interne : 000090 ( Pmc/Corpus ); précédent : 0000899; suivant : 0000910 ***** probable Xml problem with record *****

Links to Exploration step

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">HTSFinder: Powerful Pipeline of DNA Signature Discovery by Parallel and Distributed Computing</title>
<author><name sortKey="Karimi, Ramin" sort="Karimi, Ramin" uniqKey="Karimi R" first="Ramin" last="Karimi">Ramin Karimi</name>
<affiliation><nlm:aff id="af1-ebo-12-2016-073">Faculty of Informatics, Department of Computer Graphics and Image Processing, University of Debrecen, Debrecen, Hungary.</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Hajdu, Andras" sort="Hajdu, Andras" uniqKey="Hajdu A" first="Andras" last="Hajdu">Andras Hajdu</name>
<affiliation><nlm:aff id="af1-ebo-12-2016-073">Faculty of Informatics, Department of Computer Graphics and Image Processing, University of Debrecen, Debrecen, Hungary.</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="af2-ebo-12-2016-073">Bioinformatics Research Group, University of Debrecen, Debrecen, Hungary.</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">26884678</idno>
<idno type="pmc">4750899</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4750899</idno>
<idno type="RBID">PMC:4750899</idno>
<idno type="doi">10.4137/EBO.S35545</idno>
<date when="2016">2016</date>
<idno type="wicri:Area/Pmc/Corpus">000090</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000090</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">HTSFinder: Powerful Pipeline of DNA Signature Discovery by Parallel and Distributed Computing</title>
<author><name sortKey="Karimi, Ramin" sort="Karimi, Ramin" uniqKey="Karimi R" first="Ramin" last="Karimi">Ramin Karimi</name>
<affiliation><nlm:aff id="af1-ebo-12-2016-073">Faculty of Informatics, Department of Computer Graphics and Image Processing, University of Debrecen, Debrecen, Hungary.</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Hajdu, Andras" sort="Hajdu, Andras" uniqKey="Hajdu A" first="Andras" last="Hajdu">Andras Hajdu</name>
<affiliation><nlm:aff id="af1-ebo-12-2016-073">Faculty of Informatics, Department of Computer Graphics and Image Processing, University of Debrecen, Debrecen, Hungary.</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="af2-ebo-12-2016-073">Bioinformatics Research Group, University of Debrecen, Debrecen, Hungary.</nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">Evolutionary Bioinformatics Online</title>
<idno type="eISSN">1176-9343</idno>
<imprint><date when="2016">2016</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><p>Comprehensive effort for low-cost sequencing in the past few years has led to the growth of complete genome databases. In parallel with this effort, a strong need, fast and cost-effective methods and applications have been developed to accelerate sequence analysis. Identification is the very first step of this task. Due to the difficulties, high costs, and computational challenges of alignment-based approaches, an alternative universal identification method is highly required. Like an alignment-free approach, DNA signatures have provided new opportunities for the rapid identification of species. In this paper, we present an effective pipeline HTSFinder (high-throughput signature finder) with a corresponding <italic>k</italic>
-mer generator GkmerG (genome <italic>k</italic>
-mers generator). Using this pipeline, we determine the frequency of <italic>k</italic>
-mers from the available complete genome databases for the detection of extensive DNA signatures in a reasonably short time. Our application can detect both unique and common signatures in the arbitrarily selected target and nontarget databases. Hadoop and MapReduce as parallel and distributed computing tools with commodity hardware are used in this pipeline. This approach brings the power of high-performance computing into the ordinary desktop personal computers for discovering DNA signatures in large databases such as bacterial genome. A considerable number of detected unique and common DNA signatures of the target database bring the opportunities to improve the identification process not only for polymerase chain reaction and microarray assays but also for more complex scenarios such as metagenomics and next-generation sequencing analysis.</p>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Kaderali, L" uniqKey="Kaderali L">L Kaderali</name>
</author>
<author><name sortKey="Schliep, A" uniqKey="Schliep A">A Schliep</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Francois, P" uniqKey="Francois P">P Francois</name>
</author>
<author><name sortKey="Charbonnier, Y" uniqKey="Charbonnier Y">Y Charbonnier</name>
</author>
<author><name sortKey="Jacquet, J" uniqKey="Jacquet J">J Jacquet</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Li, F" uniqKey="Li F">F Li</name>
</author>
<author><name sortKey="Stormo, Gd" uniqKey="Stormo G">GD Stormo</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Coenye, T" uniqKey="Coenye T">T Coenye</name>
</author>
<author><name sortKey="Vandamme, P" uniqKey="Vandamme P">P Vandamme</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="V Trovsk, T" uniqKey="V Trovsk T">T Větrovský</name>
</author>
<author><name sortKey="Baldrian, P" uniqKey="Baldrian P">P Baldrian</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wooley, Jc" uniqKey="Wooley J">JC Wooley</name>
</author>
<author><name sortKey="Godzik, A" uniqKey="Godzik A">A Godzik</name>
</author>
<author><name sortKey="Friedberg, I" uniqKey="Friedberg I">I Friedberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Tembe, W" uniqKey="Tembe W">W Tembe</name>
</author>
<author><name sortKey="Zavaljevski, N" uniqKey="Zavaljevski N">N Zavaljevski</name>
</author>
<author><name sortKey="Bode, E" uniqKey="Bode E">E Bode</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Satya, Rv" uniqKey="Satya R">RV Satya</name>
</author>
<author><name sortKey="Zavaljevski, N" uniqKey="Zavaljevski N">N Zavaljevski</name>
</author>
<author><name sortKey="Kumar, K" uniqKey="Kumar K">K Kumar</name>
</author>
<author><name sortKey="Reifman, J" uniqKey="Reifman J">J Reifman</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Satya, Rv" uniqKey="Satya R">RV Satya</name>
</author>
<author><name sortKey="Kumar, K" uniqKey="Kumar K">K Kumar</name>
</author>
<author><name sortKey="Zavaljevski, N" uniqKey="Zavaljevski N">N Zavaljevski</name>
</author>
<author><name sortKey="Reifman, J" uniqKey="Reifman J">J Reifman</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Vijaya Satya, R" uniqKey="Vijaya Satya R">R Vijaya Satya</name>
</author>
<author><name sortKey="Zavaljevski, N" uniqKey="Zavaljevski N">N Zavaljevski</name>
</author>
<author><name sortKey="Kumar, K" uniqKey="Kumar K">K Kumar</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Phillippy, Am" uniqKey="Phillippy A">AM Phillippy</name>
</author>
<author><name sortKey="Ayanbule, K" uniqKey="Ayanbule K">K Ayanbule</name>
</author>
<author><name sortKey="Edwards, Nj" uniqKey="Edwards N">NJ Edwards</name>
</author>
<author><name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
<author><name sortKey="Phillippy, A" uniqKey="Phillippy A">A Phillippy</name>
</author>
<author><name sortKey="Delcher, Al" uniqKey="Delcher A">AL Delcher</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bader, Kc" uniqKey="Bader K">KC Bader</name>
</author>
<author><name sortKey="Grothoff, C" uniqKey="Grothoff C">C Grothoff</name>
</author>
<author><name sortKey="Meier, H" uniqKey="Meier H">H Meier</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lee, Hp" uniqKey="Lee H">HP Lee</name>
</author>
<author><name sortKey="Sheu, T F" uniqKey="Sheu T">T-F Sheu</name>
</author>
<author><name sortKey="Tang, Cy" uniqKey="Tang C">CY Tang</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lee, Hp" uniqKey="Lee H">HP Lee</name>
</author>
<author><name sortKey="Sheu, Tf" uniqKey="Sheu T">TF Sheu</name>
</author>
<author><name sortKey="Tsai, Yt" uniqKey="Tsai Y">YT Tsai</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zheng, J" uniqKey="Zheng J">J Zheng</name>
</author>
<author><name sortKey="Close, Tj" uniqKey="Close T">TJ Close</name>
</author>
<author><name sortKey="Jiang, T" uniqKey="Jiang T">T Jiang</name>
</author>
<author><name sortKey="Lonardi, S" uniqKey="Lonardi S">S Lonardi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lee, Hp" uniqKey="Lee H">HP Lee</name>
</author>
<author><name sortKey="Huang, Y H" uniqKey="Huang Y">Y-H Huang</name>
</author>
<author><name sortKey="Sheu, Tf" uniqKey="Sheu T">TF Sheu</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lee, Hp" uniqKey="Lee H">HP Lee</name>
</author>
<author><name sortKey="Sheu, Tf" uniqKey="Sheu T">TF Sheu</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Marcais, G" uniqKey="Marcais G">G Marcais</name>
</author>
<author><name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kaderali, L" uniqKey="Kaderali L">L Kaderali</name>
</author>
<author><name sortKey="Schliep, A" uniqKey="Schliep A">A Schliep</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Rouillard, Jm" uniqKey="Rouillard J">JM Rouillard</name>
</author>
<author><name sortKey="Zuker, M" uniqKey="Zuker M">M Zuker</name>
</author>
<author><name sortKey="Gulari, E" uniqKey="Gulari E">E Gulari</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wernersson, R" uniqKey="Wernersson R">R Wernersson</name>
</author>
<author><name sortKey="Nielsen, Hb" uniqKey="Nielsen H">HB Nielsen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Nordberg, Ek" uniqKey="Nordberg E">EK Nordberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ashelford, Ke" uniqKey="Ashelford K">KE Ashelford</name>
</author>
<author><name sortKey="Weightman, Aj" uniqKey="Weightman A">AJ Weightman</name>
</author>
<author><name sortKey="Fry, Jc" uniqKey="Fry J">JC Fry</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ludwig, W" uniqKey="Ludwig W">W Ludwig</name>
</author>
<author><name sortKey="Strunk, O" uniqKey="Strunk O">O Strunk</name>
</author>
<author><name sortKey="Westram, R" uniqKey="Westram R">R Westram</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Adams, Md" uniqKey="Adams M">MD Adams</name>
</author>
<author><name sortKey="Kelley, Jm" uniqKey="Kelley J">JM Kelley</name>
</author>
<author><name sortKey="Gocayne, Jd" uniqKey="Gocayne J">JD Gocayne</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Baxevanis, Ad" uniqKey="Baxevanis A">AD Baxevanis</name>
</author>
<author><name sortKey="Ouellette, Bf" uniqKey="Ouellette B">BF Ouellette</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Choudhary, M" uniqKey="Choudhary M">M Choudhary</name>
</author>
<author><name sortKey="Mackenzie, C" uniqKey="Mackenzie C">C Mackenzie</name>
</author>
<author><name sortKey="Nereng, Ks" uniqKey="Nereng K">KS Nereng</name>
</author>
<author><name sortKey="Sodergren, E" uniqKey="Sodergren E">E Sodergren</name>
</author>
<author><name sortKey="Weinstock, Gm" uniqKey="Weinstock G">GM Weinstock</name>
</author>
<author><name sortKey="Kaplan, S" uniqKey="Kaplan S">S Kaplan</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="White, T" uniqKey="White T">T White</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Shvachko, K" uniqKey="Shvachko K">K Shvachko</name>
</author>
<author><name sortKey="Kuang, H" uniqKey="Kuang H">H Kuang</name>
</author>
<author><name sortKey="Radia, S" uniqKey="Radia S">S Radia</name>
</author>
<author><name sortKey="Chansler, R" uniqKey="Chansler R">R Chansler</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Dean, J" uniqKey="Dean J">J Dean</name>
</author>
<author><name sortKey="Ghemawat, S" uniqKey="Ghemawat S">S Ghemawat</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Battre, D" uniqKey="Battre D">D Battre</name>
</author>
<author><name sortKey="Ewen, S" uniqKey="Ewen S">S Ewen</name>
</author>
<author><name sortKey="Hueske, F" uniqKey="Hueske F">F Hueske</name>
</author>
<author><name sortKey="Kao, O" uniqKey="Kao O">O Kao</name>
</author>
<author><name sortKey="Markl, V" uniqKey="Markl V">V Markl</name>
</author>
<author><name sortKey="Warneke, D" uniqKey="Warneke D">D Warneke</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Capriolo, E" uniqKey="Capriolo E">E Capriolo</name>
</author>
<author><name sortKey="Wampler, D" uniqKey="Wampler D">D Wampler</name>
</author>
<author><name sortKey="Rutherglen, J" uniqKey="Rutherglen J">J Rutherglen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Thusoo, A" uniqKey="Thusoo A">A Thusoo</name>
</author>
<author><name sortKey="Sarma, Js" uniqKey="Sarma J">JS Sarma</name>
</author>
<author><name sortKey="Jain, N" uniqKey="Jain N">N Jain</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Ulrich, Rl" uniqKey="Ulrich R">RL Ulrich</name>
</author>
<author><name sortKey="Ulrich, Mp" uniqKey="Ulrich M">MP Ulrich</name>
</author>
<author><name sortKey="Schell, Ma" uniqKey="Schell M">MA Schell</name>
</author>
<author><name sortKey="Kim, Hs" uniqKey="Kim H">HS Kim</name>
</author>
<author><name sortKey="Deshazer, D" uniqKey="Deshazer D">D DeShazer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Godoy, D" uniqKey="Godoy D">D Godoy</name>
</author>
<author><name sortKey="Randle, G" uniqKey="Randle G">G Randle</name>
</author>
<author><name sortKey="Simpson, Aj" uniqKey="Simpson A">AJ Simpson</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">Evol Bioinform Online</journal-id>
<journal-id journal-id-type="iso-abbrev">Evol. Bioinform. Online</journal-id>
<journal-id journal-id-type="publisher-id">Evolutionary Bioinformatics</journal-id>
<journal-title-group><journal-title>Evolutionary Bioinformatics Online</journal-title>
</journal-title-group>
<issn pub-type="epub">1176-9343</issn>
<publisher><publisher-name>Libertas Academica</publisher-name>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">26884678</article-id>
<article-id pub-id-type="pmc">4750899</article-id>
<article-id pub-id-type="doi">10.4137/EBO.S35545</article-id>
<article-id pub-id-type="publisher-id">ebo-12-2016-073</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Technical Advance</subject>
</subj-group>
</article-categories>
<title-group><article-title>HTSFinder: Powerful Pipeline of DNA Signature Discovery by Parallel and Distributed Computing</article-title>
</title-group>
<contrib-group><contrib contrib-type="author"><name><surname>Karimi</surname>
<given-names>Ramin</given-names>
</name>
<xref ref-type="aff" rid="af1-ebo-12-2016-073">1</xref>
<xref ref-type="corresp" rid="c1-ebo-12-2016-073"></xref>
</contrib>
<contrib contrib-type="author"><name><surname>Hajdu</surname>
<given-names>Andras</given-names>
</name>
<xref ref-type="aff" rid="af1-ebo-12-2016-073">1</xref>
<xref ref-type="aff" rid="af2-ebo-12-2016-073">2</xref>
</contrib>
</contrib-group>
<aff id="af1-ebo-12-2016-073"><label>1</label>
Faculty of Informatics, Department of Computer Graphics and Image Processing, University of Debrecen, Debrecen, Hungary.</aff>
<aff id="af2-ebo-12-2016-073"><label>2</label>
Bioinformatics Research Group, University of Debrecen, Debrecen, Hungary.</aff>
<author-notes><corresp id="c1-ebo-12-2016-073">CORRESPONDENCE: <email>raminkm2000@yahoo.ca</email>
</corresp>
</author-notes>
<pub-date pub-type="collection"><year>2016</year>
</pub-date>
<pub-date pub-type="epub"><day>10</day>
<month>2</month>
<year>2016</year>
</pub-date>
<volume>12</volume>
<fpage>73</fpage>
<lpage>85</lpage>
<history><date date-type="received"><day>28</day>
<month>9</month>
<year>2015</year>
</date>
<date date-type="rev-recd"><day>05</day>
<month>11</month>
<year>2015</year>
</date>
<date date-type="accepted"><day>05</day>
<month>12</month>
<year>2015</year>
</date>
</history>
<permissions><copyright-statement>© 2016 the author(s), publisher and licensee Libertas Academica Ltd.</copyright-statement>
<copyright-year>2016</copyright-year>
<license license-type="open-access"><license-p>This is an open access article published under the Creative Commons CC-BY-NC 3.0 license.</license-p>
</license>
</permissions>
<abstract><p>Comprehensive effort for low-cost sequencing in the past few years has led to the growth of complete genome databases. In parallel with this effort, a strong need, fast and cost-effective methods and applications have been developed to accelerate sequence analysis. Identification is the very first step of this task. Due to the difficulties, high costs, and computational challenges of alignment-based approaches, an alternative universal identification method is highly required. Like an alignment-free approach, DNA signatures have provided new opportunities for the rapid identification of species. In this paper, we present an effective pipeline HTSFinder (high-throughput signature finder) with a corresponding <italic>k</italic>
-mer generator GkmerG (genome <italic>k</italic>
-mers generator). Using this pipeline, we determine the frequency of <italic>k</italic>
-mers from the available complete genome databases for the detection of extensive DNA signatures in a reasonably short time. Our application can detect both unique and common signatures in the arbitrarily selected target and nontarget databases. Hadoop and MapReduce as parallel and distributed computing tools with commodity hardware are used in this pipeline. This approach brings the power of high-performance computing into the ordinary desktop personal computers for discovering DNA signatures in large databases such as bacterial genome. A considerable number of detected unique and common DNA signatures of the target database bring the opportunities to improve the identification process not only for polymerase chain reaction and microarray assays but also for more complex scenarios such as metagenomics and next-generation sequencing analysis.</p>
</abstract>
<kwd-group><kwd>DNA signature</kwd>
<kwd><italic>k</italic>
-mers</kwd>
<kwd>Hadoop</kwd>
<kwd>WordCount</kwd>
<kwd>MapReduce</kwd>
<kwd>Hive</kwd>
</kwd-group>
</article-meta>
</front>
<body><sec sec-type="intro"><title>Introduction</title>
<p>DNA signature is a short <italic>k</italic>
-mer oligonucleotide fragment with an arbitrary length <italic>k</italic>
, which is unique or specific for a particular group of species selected from a target genome database. There are two categories of unique and common signatures according to the purpose of usage. The presence of a unique DNA signature in any volume of sequences and genetic materials represents the existence of the corresponding species.<xref rid="b1-ebo-12-2016-073" ref-type="bibr">1</xref>
,<xref rid="b2-ebo-12-2016-073" ref-type="bibr">2</xref>
 Therefore, signature discovery is the action of finding specific fragments of genome in a database.<xref rid="b3-ebo-12-2016-073" ref-type="bibr">3</xref>
 Any pipeline, application, or algorithm that is designed for DNA signature discovery has to detect an entire database or multiple databases recursively. The procedure varies according to the purpose of using DNA signatures.</p>
<p>Despite the impact of the sequences 16S rDNA and 16S rRNA in the microbial taxonomy, they are particularly useful for taxa above the rank of species. Because of sequence similarities, they are not sufficient to define bacterial species and strains.<xref rid="b4-ebo-12-2016-073" ref-type="bibr">4</xref>
 Approximately 15% of bacterial genomes contain only a single copy of 16S rRNA.<xref rid="b5-ebo-12-2016-073" ref-type="bibr">5</xref>
 Since the high-throughput sequences are often noisy and partial,<xref rid="b6-ebo-12-2016-073" ref-type="bibr">6</xref>
 the application of 16S sequences for the next-generation sequencing (NGS) data analysis at species level is even less efficient. Concerning the large number of DNA signatures in different species and the possibility to choose arbitrary lengths of them for identification, this approach is not only suitable for polymerase chain reaction (PCR) and microarray-based assays but also has great potential for NGS analysis. The pipeline high-throughput signature finder (HTSFinder) that is proposed in this paper has been designed to address some of the challenges of DNA signature discovery in order to enhance the usability of DNA signatures for NGS analysis.</p>
<p>Several tools and algorithms of DNA signature discovery have been proposed in the literature in order to facilitate the design of microbial and pathogen-based diagnostic assays; notable instances are discussed in the following sections.</p>
<p>Tool for Oligonucleotide Fingerprint Identification (TOFI)<xref rid="b7-ebo-12-2016-073" ref-type="bibr">7</xref>
 is designed to identify DNA fingerprints of a single genome as suitable probes for microarray-based diagnostic assays. It utilizes the whole genome of the pathogen instead of the special gene (such as 16s rRNA) or special regions of the genome for designing probes.<xref rid="b8-ebo-12-2016-073" ref-type="bibr">8</xref>
 In order to design DNA microarray probes, TOFI reduces the solution space by discarding DNA sequences that are common to the target sequence and one or more phylogenetically close sequences. Then, each extracted DNA microarray probe is compared with all DNA sequences from the chosen reference database.<xref rid="b7-ebo-12-2016-073" ref-type="bibr">7</xref>
</p>
<p>Tool for PCR Signature Identification (TOPSI)<xref rid="b9-ebo-12-2016-073" ref-type="bibr">9</xref>
 is a pipeline for real-time PCR signature discovery. TOPSI detects common signatures among multiple strains of bacterial genomes by collecting the shared regions through pairwise alignments between the input genomes. It is an extended version of TOFI.<xref rid="b9-ebo-12-2016-073" ref-type="bibr">9</xref>
,<xref rid="b10-ebo-12-2016-073" ref-type="bibr">10</xref>
</p>
<p>Insignia<xref rid="b11-ebo-12-2016-073" ref-type="bibr">11</xref>
 provides unique signatures that can be used to design primers for PCR and probes for microarray assays. It has two main components: the web interface and the computational pipeline. The computational pipeline uses grid computing and an algorithm to perform pairwise alignment of every pair of target genomes and background genomes for their comparison. Insignia provides signatures that are unique against the background genomes based on databases of bacterial and viral genomic sequences containing 13,928 organisms (11,274 viruses/phages and 2,653 bacteria).<xref rid="b12-ebo-12-2016-073" ref-type="bibr">12</xref>
 In fact, when a user adjusts the desired options in the Insignia web interface, a query runs on the database that contains the results of DNA signature discovery which has already been provided.</p>
<p>TOFI, TOPSI, and Insignia use the open-source software MUMmer<xref rid="b13-ebo-12-2016-073" ref-type="bibr">13</xref>
 that implements a suffix-tree-based algorithm for comparing genomic sequences.<xref rid="b7-ebo-12-2016-073" ref-type="bibr">7</xref>
,<xref rid="b9-ebo-12-2016-073" ref-type="bibr">9</xref>
,<xref rid="b11-ebo-12-2016-073" ref-type="bibr">11</xref>
 It is a package for the alignment of very large DNA and amino acid sequences. Furthermore, these three pipelines use Basic Local Alignment Search Tool (BLAST) for the evaluation of signatures regarding specificity.</p>
<p>CaSSiS<xref rid="b14-ebo-12-2016-073" ref-type="bibr">14</xref>
 is an algorithm for detecting signatures with maximal group coverage within a user-defined specificity range for designing primers and probes. It provides signatures for single or group organisms in hierarchically clustered sequence datasets. This algorithm calculates the Hamming distance between a signature candidate and its matched targets. CaSSiS uses the rRNA sequences provided by the database SILVA to create a signature collection for designing primers and probes.</p>
<p>The consecutive multiple discovery (CMD) algorithm<xref rid="b15-ebo-12-2016-073" ref-type="bibr">15</xref>
 is an iterative method including the parallel and incremental signature discovery (PISD) method as a kernel routine to discover implicit DNA signatures. PISD is a combination of the Hamming-distance-based algorithm, the Internal-memory-based unique signature discovery (IMUS) approach,<xref rid="b16-ebo-12-2016-073" ref-type="bibr">16</xref>
 and Zheng’s method<xref rid="b17-ebo-12-2016-073" ref-type="bibr">17</xref>
 in terms of using the corresponding incremental and parallel computing techniques. PISD uses a mismatch tolerance and previously discovered signatures of specific lengths as candidates to find shorter signatures instead of scanning the whole database. CMD and PISD can find unique signatures for single sequences, but cannot search for signatures that are specific for groups;<xref rid="b16-ebo-12-2016-073" ref-type="bibr">16</xref>
 they are designed to find signatures of sequences from expressed sequence tag (EST) databases.</p>
<p>The internal memory-based unique signature discovery algorithm IMUS<xref rid="b16-ebo-12-2016-073" ref-type="bibr">16</xref>
 is an improvement of Zheng’s method,<xref rid="b17-ebo-12-2016-073" ref-type="bibr">17</xref>
 which is based on the Hamming distance for detecting unique signatures. IMUS tries to discard similar substrings of a sequence in order to obtain the DNA signatures as unique fragments. Parallel internal-memory-based unique signature discovery (PIMUS)<xref rid="b18-ebo-12-2016-073" ref-type="bibr">18</xref>
 is the improved version of IMUS. Both algorithms load the complete DNA database into the main memory to find unique signatures in EST datasets.</p>
<p>Distributed divide-and-conquer-based signature discovery (DDCSD)<xref rid="b19-ebo-12-2016-073" ref-type="bibr">19</xref>
 applies a divide-and-conquer strategy for detecting DNA signatures. When the dataset is large and cannot be loaded into the memory all at once, the algorithm splits it into smaller segments in which parts are loaded and processed one by one. The discovery node and the discovery routine are the main components of this algorithm. When the size of the dataset is larger than the available memory, the discovery routine splits the dataset into multiple parts that are processed one at a time by the discovery nodes. This algorithm is based on searching for similarities and mismatches in the patterns. Similar to CMD, PISD, IMUS, and PIMUS, this algorithm is designed to search EST datasets, but it can process larger databases such as the human whole-genome EST database as well. <xref ref-type="table" rid="t1-ebo-12-2016-073">Table 1</xref>
 contains some more details on the algorithms described earlier.</p>
<p>Jellyfish<xref rid="b20-ebo-12-2016-073" ref-type="bibr">20</xref>
 is an algorithm to count the <italic>k</italic>
-mers in parallel. This algorithm implements a lock-free hash table optimization for counting <italic>k</italic>
-mers up to 31 bases in length.</p>
<p>There are other approaches to find signatures or probe sequences, such as PROBESEL,<xref rid="b21-ebo-12-2016-073" ref-type="bibr">21</xref>
 OligoArray,<xref rid="b22-ebo-12-2016-073" ref-type="bibr">22</xref>
 OligoWiz,<xref rid="b23-ebo-12-2016-073" ref-type="bibr">23</xref>
 YODA,<xref rid="b24-ebo-12-2016-073" ref-type="bibr">24</xref>
 PRIMROSE,<xref rid="b25-ebo-12-2016-073" ref-type="bibr">25</xref>
 and ARB-ProbeDesign.<xref rid="b26-ebo-12-2016-073" ref-type="bibr">26</xref>
 All of them are limited to one selected target or single sequence in each run; thus, they are not applicable for large datasets.<xref rid="b14-ebo-12-2016-073" ref-type="bibr">14</xref>
</p>
<p>In practice, despite the respected efforts of abovementioned and other methods, there are still a number of limitations for DNA signature discovery.</p>
<p>Since most existing methods of DNA signature discovery require significant computational resources, they are not applicable for the entire research community. Due to the size of genome databases, the large amount of random-access memory (RAM) and central processing unit (CPU) capacity requirements and long execution times are the major limitations of most of the abovementioned methods that are based on pattern comparison and pairwise alignment of the genomes. The determination of the mismatch tolerance level as a discovery condition also influences the results.</p>
<p>In some cases, it is necessary to load the whole dataset into the main memory for searching for unique or common signatures. When the size of the data exceeds the available memory, the execution will fail. For instance, in IMUS, PIMUS, and Zheng’s methods, the entire database has to be loaded into the memory.<xref rid="b19-ebo-12-2016-073" ref-type="bibr">19</xref>
 Thus, for such sequential algorithms like IMUS, increasing the number of CPU cores does not increase the discovery efficiency of the algorithm.<xref rid="b18-ebo-12-2016-073" ref-type="bibr">18</xref>
 Another limitation for most of the abovementioned methods is the lack of efficiency to find both unique and common signatures simultaneously. Most of them are capable to find only DNA signatures of a single genome. In addition, the limitation of some of these methods is the lack of the possibility to select an arbitrary length (<italic>k</italic>
) for the signatures.</p>
<p>The additional challenge as another major limitation for DNA signature discovery methods is the lack of option in the choice of target and nontarget genome databases. TOFI, TOPSI, and Insignia use BLAST databases (such as nt and nr databases) for the background or nontarget genomes for specificity evaluation of signatures and there is no option for the user to choose other target and nontarget genome databases. As an example, in the Insignia web interface, the user receives a quick response without special requirements on local computational resources. However, this privilege comes with the restriction that there is no option to use other sequences as the target and background genomes, because they are part of the Insignia database.<xref rid="b19-ebo-12-2016-073" ref-type="bibr">19</xref>
 With the advancements of the sequencing technologies and the increasing number of complete genomes, whole-genome shotgun sequences, and draft genomes, it is obvious that some of these signatures will not be unique later using BLAST specificity evaluation. This issue is a challenge not only for DNA signatures but also for all the sequence-based identification methods.</p>
<p>Geographical distribution and diversity of the species, ecological and chemical status, host and environmental factors, isolation or complexity of the samples, and many other factors can have a great impact on the selection of target and nontarget genome databases for DNA signature discovery. When the absence of a considerable number of species in the sample is evident, it seems quite questionable that we eliminate a large number of useful DNA signatures through their assessment and specificity evaluation against the entire background sequence databases such as BLAST. For instance, when we are sure that, in the sample, there is nothing from zebra fish, mouse, chimpanzee, black cottonwood, <italic>Macaca fascicularis</italic>
, etc., we do not need to check the uniqueness of our DNA signatures against their genomes; otherwise, we would lose a significant number of signatures.</p>
<p>ESTs are short fragments of mRNA sequences obtained by single sequencing of randomly selected cDNA clones. ESTs are mostly used either to identify gene transcripts or as an alternative cheap method of gene discovery and gene sequence determination.<xref rid="b27-ebo-12-2016-073" ref-type="bibr">27</xref>
</p>
<p>IMUS, PIMUS, CMD, PISD, and DDCSD are designed to scan EST sequences for the unique signatures. However, the ESTs represent only fragments of genes, not complete coding sequences;<xref rid="b28-ebo-12-2016-073" ref-type="bibr">28</xref>
 therefore, many signatures are missed.</p>
<p>The pipeline HTSFinder has significant advantages compared with the DNA signature discovery pipelines and algorithms described earlier.
<list list-type="bullet"><list-item><p>First, HTSFinder is capable to detect all unique, common, and maximal group coverage signatures of the entire database or multiple databases simultaneously.</p>
</list-item>
<list-item><p>Second, it becomes possible to select target and nontarget genome databases, based on user requirements. For instance, we have the ability to use both forward and reverse-complement genome sequences of a database for detecting DNA signatures.</p>
</list-item>
<list-item><p>Third, the pipeline can be considered either a cluster of low-cost computer nodes that are commonly available in research facilities or a high-performance computing (HPC).</p>
</list-item>
<list-item><p>Finally, the flexibility of the different phases of the pipeline makes it suitable for other bioinformatic and metagenomic studies such as NGS analysis.</p>
</list-item>
</list>
</p>
<p>HTSFinder is very efficient and powerful with high accuracy for both unique and group-specific signatures without discarding even a single signature from the database, except those ones that contain the International Union of Pure and Applied Chemistry (IUPAC) nucleotide codes such as K, M, N, R, S, W, and Y. Our GkmerG component will remove any <italic>k</italic>
-mer containing IUPAC nucleotide codes after generating the <italic>k</italic>
-mers. In this pipeline, there is nothing to worry about the mismatch tolerance and complexity of comparison and pairwise alignment search methods.</p>
</sec>
<sec sec-type="materials|methods"><title>Materials and Methods</title>
<sec><title>Description of the pipeline</title>
<p>HTSFinder consists of three computational phases as shown in <xref ref-type="fig" rid="f1-ebo-12-2016-073">Figure 1</xref>
. This pipeline generates all the possibilities of <italic>k</italic>
-mers for every genome individually and then determines their frequency in the entire database. Finally, DNA signatures of every species or strain are obtained in the database or multiple databases that have been involved in the pipeline. HTSFinder implements the parallel and distributed computational tool Hadoop for the second and third phases.</p>
</sec>
<sec sec-type="methods"><title>Data preparation</title>
<p>The first phase of the pipeline is carried out by GkmerG that is designed to obtain all the possibilities of <italic>k</italic>
-mers of genome sequences with FAST-All (FASTA) format (*.fna or *.fa). This software tool removes the remarks of the genome and splits it to the specific length <italic>k</italic>
. Then, it eliminates the <italic>k</italic>
-mers that contain IUPAC nucleotide codes and every subsequence of length less than <italic>k</italic>
 which has remained from the end of the sequence after splitting. <xref ref-type="fig" rid="f2-ebo-12-2016-073">Figure 2</xref>
 illustrates the split of the genome by GkmerG. Concatenating the files, sorting <italic>k</italic>
-mers, and removing all duplicates except one are the last steps of GkmerG. For the species with multiple chromosomes and some bacterial genomes that are composed of multiple chromosomes<xref rid="b29-ebo-12-2016-073" ref-type="bibr">29</xref>
 and plasmids, GkmerG concatenates them into a single file before sorting at the end of the first phase. GkmerG copies the original database into another directory as the reference database by appending a number to the beginning of every species name in it, to simplify the future data management. Once we get the output of the first phase for a database, we can keep it forever. In case of any update in the database, we need only to repeat this phase for the updated or new genomes, not for the whole database. The output of GkmerG is the input for the second step in the pipeline which is described in the following section.</p>
</sec>
<sec><title>The Apache Hadoop</title>
<p>Dramatic increase in the amount of data in various, particularly biological fields of science revealed the inadequacy of existing ordinary computers for big data analytics. It has prompted the developers to compose tools and applications using parallel and distributed computing that could be applicable on commodity hardware. The Apache Hadoop project<xref rid="b30-ebo-12-2016-073" ref-type="bibr">30</xref>
–<xref rid="b32-ebo-12-2016-073" ref-type="bibr">32</xref>
 has been designed as an open source, Java-based software framework for parallel and distributed computing on large datasets using commodity hardware. Hadoop allows to run simple programming models on large structured and unstructured datasets across an arbitrary number of nodes in a cluster. A Hadoop cluster has single master and several slave nodes that are connected to each other through Secure Shell. It can run as a single-node or multi-node cluster with thousands of nodes. The Hadoop core has two primary components: Hadoop Distributed File System (HDFS) and MapReduce.</p>
<p>HDFS<xref rid="b31-ebo-12-2016-073" ref-type="bibr">31</xref>
,<xref rid="b33-ebo-12-2016-073" ref-type="bibr">33</xref>
 is the data storage part of Hadoop. It provides high-throughput access to large datasets across multi-nodes of a cluster. HDFS breaks down the data into small chunks, which are stored as independent elements. HDFS provides input data storage for the Map Reduce framework.</p>
<p><italic>MapReduce</italic>
<xref rid="b31-ebo-12-2016-073" ref-type="bibr">31</xref>
 is a programming model for parallel and distributed data processing. MapReduce works by breaking the processing into two phases: the Map and the Reduce. The Map phase processes a set of data in parallel and returns it as an intermediate result, and then the Reduce phase reduces it to a smaller set of data. Each Map and Reduce works independently. In fact, MapReduce decreases the large amount of raw input data into smaller amount of useful data for further processing.<xref rid="b34-ebo-12-2016-073" ref-type="bibr">34</xref>
,<xref rid="b35-ebo-12-2016-073" ref-type="bibr">35</xref>
</p>
<p>As another component of the second-generation Hadoop 2 release of Apache, YARN (Yet Another Resource Negotiator) was added in order to upgrade scheduling, resource management, and execution in Hadoop.<xref rid="b36-ebo-12-2016-073" ref-type="bibr">36</xref>
</p>
<p>Although Hadoop is defined as a distributed system with multi-nodes, the ability of Apache Hadoop to use MapReduce for parallel processing of large datasets is an extra power to let even a single-node processes large datasets exceeding memory and CPU capacity.</p>
<p>In this research, we used the Hadoop framework and WordCount program to calculate the frequency of <italic>k</italic>
-mers in very large genome datasets. In Hadoop 1.2.1 and earlier releases, the JAR (Java Archive) file of WordCount is also included. <xref ref-type="fig" rid="f3-ebo-12-2016-073">Figure 3</xref>
 illustrates a MapReduce and WordCount process.</p>
<p>In the second step of the pipeline, we copy all the out-put files of the first step to the HDFS and run the Word-Count program.</p>
<p>The result of this step is a large file containing sorted and non-duplicate list of <italic>k</italic>
-mers obtained from the files generated in the first step in one column and another column containing the frequencies of <italic>k</italic>
-mers among genomes of the database. A <italic>k</italic>
-mer with a frequency value 1 indicates that this <italic>k</italic>
-mer is a unique substring that appeared only in one of the species in the database. These occurrences are primarily what we are looking for as unique DNA signatures. Any value preceding a <italic>k</italic>
-mer indicates the number of genomes (species) that contain the given <italic>k</italic>
-mer. <xref ref-type="table" rid="t2-ebo-12-2016-073">Table 2</xref>
 shows a portion of the Hadoop and WordCount output. For instance, the 18-mer with frequency 8 in the first row of <xref ref-type="table" rid="t2-ebo-12-2016-073">Table 2</xref>
 means that this 18-mer occurs in eight genomes among the 2,773 bacterial ones, while the 18-mer in the fourth row is a unique signature in the database. In the second step of the pipeline, we can extract all unique signatures or group-specific signatures due to the frequency, but we cannot determine the owner of the signatures.</p>
<p>Once we execute the second phase for a database, we can use the results in the future until the next update of the database. However, as a difference from the updatable first phase, in case of any update in the database, we have to repeat the second phase for the entire database.</p>
<p>When there are multiple target and nontarget databases, it is possible to merge all of them in the pipeline, but as the input grows larger, it requires far more computational resources. As a suggestion, it is better to implement the first and second steps for every database separately. With respect to the WordCount function that discards repeated <italic>k</italic>
-mers and keeps only one in the output, we can reduce the size of output files and also the execution time. Then, we can merge the output of the second phase for all the databases and repeat the second phase with WordCount in Hadoop one more time. In this case, we have a shorter process. Moreover, for the future execution, we can select the output files of the second phase as the candidate of their corresponding databases. In this case, we do not need to perform the first phase of the pipeline for the previously processed databases and we can repeat the second phase for target and nontarget databases by merging the smaller files. For instance, the output of the first phase for the bacterial genome database resulted in a file with 177.35 GB of 18-mers. However, in the second phase, the size of this file was reduced to 103.03 GB that contained all the candidates of 18-mers in the database without any repeat. We can use this file as the candidate of bacterial genome database for further processing. <xref ref-type="fig" rid="f4-ebo-12-2016-073">Figure 4</xref>
 illustrates the process of finding DNA signatures of the target database among nontarget databases.</p>
<p>The input for the third phase of the pipeline is the output of the first and second phases. The proper steps of the third phase are described in the following section.</p>
</sec>
<sec><title>The Apache Hive</title>
<p><italic>Hive</italic>
<xref rid="b37-ebo-12-2016-073" ref-type="bibr">37</xref>
–<xref rid="b39-ebo-12-2016-073" ref-type="bibr">39</xref>
 is a data warehouse infrastructure on the top of the Hadoop MapReduce framework. It is designed to query a large dataset that is stored in the HDFS using an SQL-like language called HiveQL. Traditional, relational databases require the data to be in a structured format, while Hive can handle both structured and unstructured information. It lets the user to process large datasets with relatively little effort and in a reasonably short time. This research proves the efficiency of Hive to handle querying on billions of rows in a table or multiple tables. With HiveQL, we can extract whatever we need from the results of the second step of the pipeline. We can extract all the unique signatures of a specific species in the database or group-specific signatures that are common among 2, 3, 4, etc. Due to the flexibility of querying in Hive, there are various ways to create the tables and design the queries in the third step. Our future study is motivated by optimizing querying. After loading the data into the tables created with Hive, we can use queries such as SELECT and JOIN to extract relationships. We should create two tables with Hive: one for the output of the first phase and another one for the complete or a special part of the output of the second phase. By considering the ability of Hive to query very large tables and prevent the repetition of queries, we added a column containing the reference number to the files from the first step. For example, file 1 contains 18-mers from the first species in the database, so we inserted a column containing reference index 1 before all 18-mers in this file. Then, we merged all the 2,773 files in a single large one (220.35 GB) with two columns of <italic>k</italic>
-mers and their related reference numbers. The reference number indicates the number that has been appended to the name of the species by GkmerG in the first phase.</p>
<p>There are several options to create the table from the output of the second step: one is to create the table without making any changes in the output and another one is to break down the output into smaller groups according to the targeted signature. For example, if we are looking for the unique signatures, it would be better to extract only 18-mers with frequency 1. However, if we are looking for a common signature, then it would be better to extract the 18-mers with a specific frequency number such as 2, 3, 4, etc. In order to have a faster and easier implementation with Hive and later steps, we recommend the second option.</p>
<p>Source code for GkmerG and information and command lines for Hadoop and Hive are freely available at: <ext-link ext-link-type="uri" xlink:href="http://www.inf.unideb.hu/~hajdua/HTSFinder.html">http://www.inf.unideb.hu/~hajdua/HTSFinder.html</ext-link>
, <ext-link ext-link-type="uri" xlink:href="https://sourceforge.net/projects/htsfinder/">https://sourceforge.net/projects/htsfinder/</ext-link>
, and <ext-link ext-link-type="uri" xlink:href="https://github.com/raminkm/HTSFinder">https://github.com/raminkm/HTSFinder</ext-link>
.</p>
</sec>
<sec sec-type="methods"><title>Selected sequence databases</title>
<sec sec-type="methods"><title>Bacterial genome database</title>
<p>To prove the efficiency of our proposed method, the bacterial genome databases with 2,773 complete genomes in FASTA format (*.fna) were downloaded from the National Center for Biotechnology Information (NCBI) database. The size of this database is 9.7 GB after decompression. The list of the bacterial genomes is available in Supplementary Files.</p>
</sec>
<sec sec-type="methods"><title>The reverse-complement bacterial genome database</title>
<p>Another database that we used in this study was the reverse-complement bacterial genome database. The revcom.pl 1.2 (available at: <ext-link ext-link-type="uri" xlink:href="http://code.google.com/p/nash-bioinformatics-codelets/">http://code.google.com/p/nash-bioinformatics-codelets/</ext-link>
) is a Perl program written by John Nash (Copyright © Government of Canada, 2000–2012). We used this program to provide the reverse-complement sequences for the whole bacterial genome database.</p>
</sec>
<sec sec-type="methods"><title>Human genome database</title>
<p>The whole human genome is another database used in this research. The Homo sapiens <italic>hs-ref-GRCh</italic>
38 sequences in FASTA format (*.fa.gz) were downloaded from the NCBI ftp database. The size of the genome was 2.9 GB after decompression.</p>
</sec>
</sec>
</sec>
<sec><title>Results and Discussion</title>
<sec sec-type="methods|results"><title>Results for the bacterial genome database</title>
<p>GkmerG has generated 2,773 files with a total size of 177.35 GB containing all the possibilities of 18-mers from individual bacterial genomes in the first step of the pipeline. After copying the results of the first step into the HDFS, we ran the Word-Count program using Hadoop in order to determine the frequencies of the 18-mers in the 2,773 files. The result of this execution was a 103.03-GB file with two columns. The first column contained the 18-mers or signatures, while the second one contains the frequency number of each 18-mer. Frequency 1 in this file represents the uniqueness of the related signature in the entire database. In other words, an 18-mer with frequency 1 is a unique signature among 2,773 bacterial genomes and an 18-mer with frequency 2 is a common signature which is presented in two genomes among 2,773 bacterial genomes. <xref ref-type="table" rid="t3-ebo-12-2016-073">Table 3</xref>
 represents the quantity of 10 least common and 10 most common 18-mers with their frequencies in the bacterial genome databases. This table shows that 3,552,866,254 of signatures are unique in the database and there is one subsequence (18-mer) that is repeated in 2,125 bacterial genomes.</p>
<p>In the third phase, we have created tables in Hive and loaded files from the first and second phases. The table with the reference numbers and <italic>k</italic>
-mers (220.35 GB) and the table with the list of unique signatures (67.5 GB) are used to run the query in Hive in order to specify the species and strains as the owners of the unique signatures. We have repeated the query on the table containing the list of signatures with frequency 2 instead of the unique signature’s table to find every pair of species with a common signature. For other frequencies, the same implementation is required.</p>
<p>As shown in <xref ref-type="table" rid="t4-ebo-12-2016-073">Table 4</xref>
, the output of the third phase was a file with two columns containing the following: the signature and the reference number indicating its corresponding bacterial genome in the reference database created by GkmerG in the first phase.</p>
<p>The following examples are parts of the results obtained by HTSFinder to show the efficiency of this pipeline.</p>
<p>No unique DNA signatures with <italic>k</italic>
 = 18 were found for 30 of the bacterial genomes in the database. They are listed in Supplementary Files.</p>
<p>The number of unique DNA signatures in 475 genomes was <10,000. <italic>Chlamydia</italic>
 as a genus of bacteria with 83 species and strains in the bacterial genome database has the lowest number of unique DNA signatures of 18-mers. The number of the unique signatures of 18-mers in 75 of them was <10,000 and in 57 it was <1,000. We have located 13 <italic>Chlamydia</italic>
 bacteria without unique signatures with <italic>k</italic>
 = 18. Top 10 bacterial genomes with the highest number of unique DNA signatures in the bacterial genome database are shown in <xref ref-type="fig" rid="f5-ebo-12-2016-073">Figure 5</xref>
.</p>
<p><italic>Burkholderia mallei</italic>
 and <italic>Burkholderia pseudomallei</italic>
 are two closely related pathogens that are very difficult cases for PCR assays. These two bacteria are the causative agents of glanders and melioidosis diseases in humans and animals.<xref rid="b9-ebo-12-2016-073" ref-type="bibr">9</xref>
,<xref rid="b40-ebo-12-2016-073" ref-type="bibr">40</xref>
,<xref rid="b41-ebo-12-2016-073" ref-type="bibr">41</xref>
 Due to the phenotypic and genotypic similarity of them, until a few years ago, they were considered to have the same species status. Concerning the literature, only one PCR signature was reported to be unique to <italic>B. mallei</italic>
.<xref rid="b9-ebo-12-2016-073" ref-type="bibr">9</xref>
,<xref rid="b40-ebo-12-2016-073" ref-type="bibr">40</xref>
 The HTSFinder pipeline could detect a considerable number of DNA signatures for <italic>B. mallei</italic>
 and <italic>B. pseudomallei</italic>
. Although these signatures are just unique in the bacterial genome database, due to the notable number of signatures listed in <xref ref-type="table" rid="t5-ebo-12-2016-073">Table 5</xref>
, it is evident that under different circumstances it would be better to have an alternative opportunity to define the uniqueness of the DNA signatures and to select the target databases according to the requirements. Moreover, it should be noted that much more DNA signatures could be found by increasing the length of <italic>k</italic>
-mers.</p>
<p>For the frequencies >1, this pipeline detects the common signatures not only among a single species and its strains but also in the entire database. Frequencies 2 and 3 have been considered samples to prove the efficiency of this pipeline for discovering common DNA signatures within the bacterial genome database.</p>
<p>As an example, the following results were obtained for <italic>Acaryochloris_marina_MBIC11017_uid58167</italic>
 that is the first bacteria in the database.</p>
<p>A total of 689,790,798 signatures of <italic>k</italic>
 = 18 with frequency 2 were found in the database, whereas 673,490 of them are shared between <italic>Acaryochloris_marina_MBIC11017_uid58167</italic>
 and 2,382 other species.</p>
<p>There was not any signature of <italic>k</italic>
 = 18 with frequency 2 between <italic>Acaryochloris_marina_MBIC11017_uid58167</italic>
 and 390 other bacterial genomes. <xref ref-type="fig" rid="f6-ebo-12-2016-073">Figure 6</xref>
 presents the highest number of signatures with frequency 2 which are common between <italic>Acaryochloris_marina_MBIC11017_uid58167</italic>
 and 10 other bacterial genomes in the database.</p>
<p>The results of the executions for signatures with frequencies 2 and 3 showed that most of the signatures are shared among phylogenetically close species of the database. However, there were also a lot of signatures that belonged to unrelated bacterial genera and families. <xref ref-type="table" rid="t6-ebo-12-2016-073">Table 6</xref>
 shows a partial view of the results for frequencies 2 and 3. From 245,109,794 signatures of frequency 3 in the database, 160,264 of them are shared between <italic>Acaryochloris_marina_MBIC11017_uid58167</italic>
 and two other species.</p>
<p>A series of lengths <italic>k</italic>
 from 21 to 30 have been considered to evaluate the effect of increasing the length of the signature in the results and to compare the results of the odd and the even number of <italic>k</italic>
. <xref ref-type="table" rid="t7-ebo-12-2016-073">Table 7</xref>
 contains the number of unique DNA signatures for the human genome and three chromosomes 1, <italic>x</italic>
, and <italic>y</italic>
, which represent large, medium, and small sequences in the human genome. This table shows that increasing the length of the signature causes increasing the number of unique signatures and there is not a meaningful difference between the odd and the even numbers.</p>
<p>To compare the number of unique signatures against the within-species variability and the entire bacterial genome database, <italic>Bacillus</italic>
 species with 81 strains in the database was selected. The three phases of the pipeline were executed on these strains. <xref ref-type="table" rid="t8-ebo-12-2016-073">Table 8</xref>
 contains five strains of <italic>Bacillus</italic>
 with the highest number of unique signatures and five others with the lowest number within species and in the entire database. Although <italic>Bacillus_anthracis_A0248_uid59385</italic>
 and <italic>Bacillus_ anthracis_Ames_Ancestor_uid58083</italic>
 have larger genome size, any unique signature of length 18 could not be found for them because of their high similarity with other <italic>Bacillus</italic>
 strains.</p>
</sec>
<sec sec-type="intro|results"><title>Results on both forward and reverse-complement sequences of bacterial genome</title>
<p>In the genome databases such as NCBI, only one strand of DNA sequence is provided. However, to design the primers, both forward and reverse-complement sequences should be considered. Moreover, depending on the sequencing technology, generated short reads can be from both strands. Therefore, the ability to obtain DNA signatures of both strands is potentially useful.</p>
<p>For the reverse-complement sequences, the size of the output files and the computational times of the first and second phases of the pipeline were the same as in the forward genome implementations. The output of the second phase for forward and reverse-complement genome databases resulted in a file of 103.03 GB for each. We have repeated WordCount on both of the databases one more time to determine the frequencies of <italic>k</italic>
-mers as illustrated in <xref ref-type="fig" rid="f4-ebo-12-2016-073">Figure 4</xref>
. The final result was a file of 52.53 GB for the forward and the same size for the reverse-complement genome database. On the one hand, it means that the volume of data containing unique signatures for the forward database decreased from 67.5 to 52.53 GB similarly to the reverse-complement genome database. On the other hand, the overall volume of DNA signatures that we could find for the species in the bacterial genome database increased from 67.5 GB containing signatures for a single strand to 105.06 GB for both strands of DNA.</p>
</sec>
<sec sec-type="intro|methods|results"><title>Implementation results for the forward and reverse-complement bacterial genome database and the human genome database</title>
<p>We have considered the forward bacterial genome as the target database. We have applied the method that is described in <xref ref-type="fig" rid="f4-ebo-12-2016-073">Figure 4</xref>
 and found 50.28 GB of <italic>k</italic>
-mers for the target genome database which are unique among the three databases.</p>
<p><xref ref-type="table" rid="t9-ebo-12-2016-073">Table 9</xref>
 presents the file size and numbers of unique DNA signatures of the target database against the nontarget ones.</p>
</sec>
<sec><title>Performance evaluation and computational times</title>
<p>For this experiment, we have applied two different platforms. The first one was a single node with 12 processors of Intel Core i7-4930K CPU at 3.40 GHz and 55 GB of RAM and 6 TB of hard disk. The operating system was Ubuntu 12.04.5 LTS, Java SE Version “1.8.0–25”, Hadoop Version 1.2.1, and Hive-0.12.0. We have installed this node as a single-node Hadoop cluster. Another platform was a multi-node cluster with seven nodes including the master node and six slave nodes. The master node was an Intel Core2 Quad CPU Q6600 at 2.40 GHz and 8 GB of RAM and 3.2 TB of hard disk, while slaves had 4 GB of RAM, Intel Core i3-2100 CPU at 3.10 GHz and 500 GB of hard disk, all with the desktop version of Ubuntu 14.04.1 LTS 64-bit, Java Version “1.7.0–65” OpenJDK, Hadoop 12.1, and Hive-0.12.0.</p>
<p>The first phase of the pipeline executed with GkmerG took 156 minutes with five nodes and 780 minutes with a single node from the second platform to generate 18-mers from the original bacterial genome database (9.7 GB). As an output, we got 2,773 files containing 18-mers with a total size of 177.35 GB.</p>
<p><xref ref-type="table" rid="t10-ebo-12-2016-073">Table 10</xref>
 contains the corresponding computational results and the size of the files in the second and third phases of the pipeline for both platforms.</p>
<p>Although the whole computation on the second platform took about nine hours more than on the first one, comparing the RAM and CPU capacity of the two platforms confirms the ability of a cluster of low-cost computers that are commonly available in research facilities to accelerate big data analytics.</p>
<p><xref ref-type="table" rid="t11-ebo-12-2016-073">Table 11</xref>
 compares the size of the files for frequencies 1–3 and the time of loading and processing the queries on the first platform. The file containing 220.35 GB data was used as the second table for all the implementations.</p>
</sec>
</sec>
<sec sec-type="conclusions"><title>Conclusions</title>
<p>Data obtained in this study clearly show the efficiency of our proposed pipeline to find all possible DNA signatures of a target database. In this pipeline, we intend to overcome some limitations of DNA signature discovery by focusing on efficiency issues to detect all the possibilities of unique and common DNA signatures in a database, regardless of such challenges as pairwise alignment and mismatch tolerance. Another important feature of this pipeline is its ability to select target and nontarget databases. From the standpoint of this research, nontarget genome database is not necessarily defined as the entire background genome databases such as BLAST for the assessment and specificity evaluation of DNA signatures. It can be determined due to the requirements. General applicability is another issue that is considered in this pipeline; it can be launched either in a cluster of low-cost nodes or in a HPC environment. Although the volumes of the datasets in this study are very large (eg, 287.85 GB in a single run), DNA signatures are detected very precisely and comprehensively in the target databases and the execution times are reasonably short. The proposed experiment is just the basic idea, and there is a great flexibility to design implementations for phases of this approach. Once the pipeline is implemented, the users will find how to manipulate their datasets according to the requirements. This pipeline can be an efficient method, not only for DNA signature discovery but also for other purposes in bioinformatic and metagenomic studies such as the alignment and assembly of short reads and next-generation sequencing analysis.</p>
</sec>
<sec sec-type="supplementary-material"><title>Supplementary Materials</title>
<p><bold>Download source for GkmerG and supplementary data:</bold>
<list list-type="order"><list-item><p><ext-link ext-link-type="uri" xlink:href="http://www.inf.unideb.hu/~hajdua/HTSFinder.html">http://www.inf.unideb.hu/~hajdua/HTSFinder.html</ext-link>
</p>
</list-item>
<list-item><p><ext-link ext-link-type="uri" xlink:href="https://sourceforge.net/projects/htsfinder/">https://sourceforge.net/projects/htsfinder/</ext-link>
</p>
</list-item>
<list-item><p><ext-link ext-link-type="uri" xlink:href="https://github.com/raminkm/HTSFinder">https://github.com/raminkm/HTSFinder</ext-link>
</p>
</list-item>
</list>
</p>
<p>Content (after decompression):
<list list-type="bullet"><list-item><p>Hadoop and Hive installation guide and command lines.</p>
</list-item>
<list-item><p>Excel file of bacterial genome database reference generated by GkmerG.</p>
</list-item>
<list-item><p>List of bacterial genomes without any unique 18-mers (DNA signature) in the database.</p>
</list-item>
<list-item><p>The GkmerG algorithm figure.</p>
</list-item>
<list-item><p>GkmerG.tar.gz including software components and an example of database for testing.</p>
</list-item>
</list>
</p>
</sec>
</body>
<back><ack><title>Acknowledgments</title>
<p>The authors would like to gratefully thank Samira Sajjadian, Edéné Rutkovszky, Herendi Tamás, and Laszló Kovács for their help, support, and encouragement. Special thanks also to the Faculty of Informatics and Department of Computer Graphics and Image Processing at the University of Debrecen for providing computational resources for this research.</p>
</ack>
<fn-group><fn id="fn1-ebo-12-2016-073"><p><bold>ACADEMIC EDITOR:</bold>
 Jike Cui, Associate Editor</p>
</fn>
<fn id="fn2-ebo-12-2016-073"><p><bold>PEER REVIEW:</bold>
 Six peer reviewers contributed to the peer review report. Reviewers’ reports totaled 2604 words, excluding any confidential comments to the academic editor.</p>
</fn>
<fn id="fn3-ebo-12-2016-073"><p><bold>FUNDING:</bold>
 This study was supported in part by the Project TAMOP-4.2.2.C-11/1/KONV-2012-0001 supported by the European Union and cofinanced by the European Social Fund. The authors confirm that the funder had no influence over the study design, content of the article, or selection of this journal.</p>
</fn>
<fn id="fn4-ebo-12-2016-073"><p><bold>COMPETING INTERESTS:</bold>
 Authors disclose no potential conflicts of interest.</p>
</fn>
<fn id="fn5-ebo-12-2016-073"><p>Paper subject to independent expert blind peer review. All editorial decisions made by independent academic editor. Upon submission manuscript was subject to anti-plagiarism scanning. Prior to publication all authors have given signed confirmation of agreement to article publication and compliance with all applicable ethical and legal requirements, including the accuracy of author and contributor information, disclosure of competing interests and funding sources, compliance with ethical requirements relating to human and animal study participants, and compliance with any copyright requirements of third parties. This journal is a member of the Committee on Publication Ethics (COPE).</p>
</fn>
<fn id="fn6-ebo-12-2016-073"><p><bold>Author Contributions</bold>
</p>
<p>Conceived and designed the experiments: RK, AH. Analyzed the data: RK. Wrote the first draft of the manuscript: RK. Contributed to the writing of the manuscript: RK, AH. Agreed with the manuscript results and conclusions: RK, AH. Jointly developed the structure and arguments for the paper: RK, AH. Made critical revisions and approved final version: RK, AH. Both authors reviewed and approved the final manuscript.</p>
</fn>
</fn-group>
<ref-list><title>REFERENCES</title>
<ref id="b1-ebo-12-2016-073"><label>1</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Kaderali</surname>
<given-names>L</given-names>
</name>
<name><surname>Schliep</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Selecting signature oligonucleotides to identify organisms using DNA arrays</article-title>
<source>Bioinformatics</source>
<year>2002</year>
<volume>18</volume>
<issue>10</issue>
<fpage>1340</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="pmid">12376378</pub-id>
</element-citation>
</ref>
<ref id="b2-ebo-12-2016-073"><label>2</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Francois</surname>
<given-names>P</given-names>
</name>
<name><surname>Charbonnier</surname>
<given-names>Y</given-names>
</name>
<name><surname>Jacquet</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Rapid bacterial identification using evanescent-waveguide oligonucleotide microarray classification</article-title>
<source>J Microbiol Methods</source>
<year>2006</year>
<volume>65</volume>
<issue>3</issue>
<fpage>390</fpage>
<lpage>403</lpage>
<pub-id pub-id-type="pmid">16216356</pub-id>
</element-citation>
</ref>
<ref id="b3-ebo-12-2016-073"><label>3</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Li</surname>
<given-names>F</given-names>
</name>
<name><surname>Stormo</surname>
<given-names>GD</given-names>
</name>
</person-group>
<article-title>Selection of optimal DNA oligos for gene expression arrays</article-title>
<source>Bioinformatics</source>
<year>2001</year>
<volume>17</volume>
<issue>11</issue>
<fpage>1067</fpage>
<lpage>76</lpage>
<pub-id pub-id-type="pmid">11724738</pub-id>
</element-citation>
</ref>
<ref id="b4-ebo-12-2016-073"><label>4</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Coenye</surname>
<given-names>T</given-names>
</name>
<name><surname>Vandamme</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Use of the genomic signature in bacterial classification and identification</article-title>
<source>Syst Appl Microbiol</source>
<year>2004</year>
<volume>27</volume>
<issue>2</issue>
<fpage>175</fpage>
<lpage>85</lpage>
<pub-id pub-id-type="pmid">15046306</pub-id>
</element-citation>
</ref>
<ref id="b5-ebo-12-2016-073"><label>5</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Větrovský</surname>
<given-names>T</given-names>
</name>
<name><surname>Baldrian</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>The variability of the 16S rRNA gene in bacterial genomes and its consequences for bacterial community analyses</article-title>
<source>PLoS One</source>
<year>2013</year>
<volume>8</volume>
<issue>2</issue>
<fpage>e57923</fpage>
<pub-id pub-id-type="pmid">23460914</pub-id>
</element-citation>
</ref>
<ref id="b6-ebo-12-2016-073"><label>6</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wooley</surname>
<given-names>JC</given-names>
</name>
<name><surname>Godzik</surname>
<given-names>A</given-names>
</name>
<name><surname>Friedberg</surname>
<given-names>I</given-names>
</name>
</person-group>
<article-title>A primer on metagenomics</article-title>
<source>PLoS Comput Biol</source>
<year>2010</year>
<volume>6</volume>
<issue>2</issue>
<fpage>e1000667</fpage>
<pub-id pub-id-type="pmid">20195499</pub-id>
</element-citation>
</ref>
<ref id="b7-ebo-12-2016-073"><label>7</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tembe</surname>
<given-names>W</given-names>
</name>
<name><surname>Zavaljevski</surname>
<given-names>N</given-names>
</name>
<name><surname>Bode</surname>
<given-names>E</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Oligonucleotide fingerprint identification for microarray-based pathogen diagnostic assays</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<issue>1</issue>
<fpage>5</fpage>
<lpage>13</lpage>
<pub-id pub-id-type="pmid">17068088</pub-id>
</element-citation>
</ref>
<ref id="b8-ebo-12-2016-073"><label>8</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Satya</surname>
<given-names>RV</given-names>
</name>
<name><surname>Zavaljevski</surname>
<given-names>N</given-names>
</name>
<name><surname>Kumar</surname>
<given-names>K</given-names>
</name>
<name><surname>Reifman</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>A high-throughput pipeline for designing microarray-based pathogen diagnostic assays</article-title>
<source>BMC Bioinformatics</source>
<year>2008</year>
<volume>9</volume>
<issue>1</issue>
<fpage>185</fpage>
<pub-id pub-id-type="pmid">18402679</pub-id>
</element-citation>
</ref>
<ref id="b9-ebo-12-2016-073"><label>9</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Satya</surname>
<given-names>RV</given-names>
</name>
<name><surname>Kumar</surname>
<given-names>K</given-names>
</name>
<name><surname>Zavaljevski</surname>
<given-names>N</given-names>
</name>
<name><surname>Reifman</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>A high-throughput pipeline for the design of real-time PCR signatures</article-title>
<source>BMC Bioinformatics</source>
<year>2010</year>
<volume>11</volume>
<issue>1</issue>
<fpage>340</fpage>
<pub-id pub-id-type="pmid">20573238</pub-id>
</element-citation>
</ref>
<ref id="b10-ebo-12-2016-073"><label>10</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Vijaya Satya</surname>
<given-names>R</given-names>
</name>
<name><surname>Zavaljevski</surname>
<given-names>N</given-names>
</name>
<name><surname>Kumar</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
<article-title>In silico microarray probe design for diagnosis of multiple pathogens</article-title>
<source>BMC Genomics</source>
<year>2008</year>
<volume>9</volume>
<issue>1</issue>
<fpage>496</fpage>
<pub-id pub-id-type="pmid">18940003</pub-id>
</element-citation>
</ref>
<ref id="b11-ebo-12-2016-073"><label>11</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Phillippy</surname>
<given-names>AM</given-names>
</name>
<name><surname>Ayanbule</surname>
<given-names>K</given-names>
</name>
<name><surname>Edwards</surname>
<given-names>NJ</given-names>
</name>
<name><surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>Insignia: a DNA signature search web server for diagnostic assay development</article-title>
<source>Nucleic Acids Res</source>
<year>2009</year>
<volume>37</volume>
<issue>Web Server issue</issue>
<fpage>W229</fpage>
<lpage>34</lpage>
<pub-id pub-id-type="pmid">19417071</pub-id>
</element-citation>
</ref>
<ref id="b12-ebo-12-2016-073"><label>12</label>
<element-citation publication-type="webpage"><person-group person-group-type="author"><collab>Insignia Database and Web Interface</collab>
</person-group>
<comment>Available at: <ext-link ext-link-type="uri" xlink:href="http://insignia.cbcb.umd.edu/">http://insignia.cbcb.umd.edu/</ext-link>
</comment>
</element-citation>
</ref>
<ref id="b13-ebo-12-2016-073"><label>13</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Kurtz</surname>
<given-names>S</given-names>
</name>
<name><surname>Phillippy</surname>
<given-names>A</given-names>
</name>
<name><surname>Delcher</surname>
<given-names>AL</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Versatile and open software for comparing large genomes</article-title>
<source>Genome Biol</source>
<year>2004</year>
<volume>5</volume>
<issue>2</issue>
<fpage>R12</fpage>
<pub-id pub-id-type="pmid">14759262</pub-id>
</element-citation>
</ref>
<ref id="b14-ebo-12-2016-073"><label>14</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bader</surname>
<given-names>KC</given-names>
</name>
<name><surname>Grothoff</surname>
<given-names>C</given-names>
</name>
<name><surname>Meier</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>Comprehensive and relaxed search for oligonucleotide signatures in hierarchically clustered sequence datasets</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>11</issue>
<fpage>1546</fpage>
<lpage>54</lpage>
<pub-id pub-id-type="pmid">21471017</pub-id>
</element-citation>
</ref>
<ref id="b15-ebo-12-2016-073"><label>15</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname>
<given-names>HP</given-names>
</name>
<name><surname>Sheu</surname>
<given-names>T-F</given-names>
</name>
<name><surname>Tang</surname>
<given-names>CY</given-names>
</name>
</person-group>
<article-title>A parallel and incremental algorithm for efficient unique signature discovery on DNA databases</article-title>
<source>BMC Bioinformatics</source>
<year>2010</year>
<volume>11</volume>
<issue>1</issue>
<fpage>132</fpage>
<pub-id pub-id-type="pmid">20230647</pub-id>
</element-citation>
</ref>
<ref id="b16-ebo-12-2016-073"><label>16</label>
<element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Lee</surname>
<given-names>HP</given-names>
</name>
<name><surname>Sheu</surname>
<given-names>TF</given-names>
</name>
<name><surname>Tsai</surname>
<given-names>YT</given-names>
</name>
</person-group>
<article-title>Efficient discovery of unique signatures on whole-genome EST databases</article-title>
<source>Proceedings of the 2005 ACM Symposium on Applied Computing</source>
<publisher-loc>New Mexico, USA</publisher-loc>
<year>2005</year>
<fpage>100</fpage>
<lpage>4</lpage>
</element-citation>
</ref>
<ref id="b17-ebo-12-2016-073"><label>17</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Zheng</surname>
<given-names>J</given-names>
</name>
<name><surname>Close</surname>
<given-names>TJ</given-names>
</name>
<name><surname>Jiang</surname>
<given-names>T</given-names>
</name>
<name><surname>Lonardi</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Efficient selection of unique and popular oligos for large EST databases</article-title>
<source>Bioinformatics</source>
<year>2004</year>
<volume>20</volume>
<issue>13</issue>
<fpage>2101</fpage>
<lpage>12</lpage>
<pub-id pub-id-type="pmid">15059835</pub-id>
</element-citation>
</ref>
<ref id="b18-ebo-12-2016-073"><label>18</label>
<element-citation publication-type="confproc"><person-group person-group-type="author"><name><surname>Lee</surname>
<given-names>HP</given-names>
</name>
<name><surname>Huang</surname>
<given-names>Y-H</given-names>
</name>
<name><surname>Sheu</surname>
<given-names>TF</given-names>
</name>
</person-group>
<source>Rapid DNA signature discovery using a novel parallel algorithm</source>
<conf-name>ICCGI 2012, The Seventh International Multi-Conference on Computing in the Global Information Technology</conf-name>
<conf-loc>Venice, Italy</conf-loc>
<year>2012</year>
<fpage>83</fpage>
<lpage>8</lpage>
</element-citation>
</ref>
<ref id="b19-ebo-12-2016-073"><label>19</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname>
<given-names>HP</given-names>
</name>
<name><surname>Sheu</surname>
<given-names>TF</given-names>
</name>
</person-group>
<article-title>An algorithm of discovering signatures from dna databases on a computer cluster</article-title>
<source>BMC Bioinformatics</source>
<year>2014</year>
<volume>15</volume>
<issue>1</issue>
<fpage>339</fpage>
<pub-id pub-id-type="pmid">25282047</pub-id>
</element-citation>
</ref>
<ref id="b20-ebo-12-2016-073"><label>20</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Marcais</surname>
<given-names>G</given-names>
</name>
<name><surname>Kingsford</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>A fast, lock-free approach for efficient parallel counting of occurrences of k-mers</article-title>
<source>Bioinformatics</source>
<year>2011</year>
<volume>27</volume>
<issue>6</issue>
<fpage>764</fpage>
<lpage>70</lpage>
<pub-id pub-id-type="pmid">21217122</pub-id>
</element-citation>
</ref>
<ref id="b21-ebo-12-2016-073"><label>21</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Kaderali</surname>
<given-names>L</given-names>
</name>
<name><surname>Schliep</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>An algorithm to select target specific probes for DNA chips</article-title>
<source>Bioinformatics</source>
<year>2002</year>
<volume>18</volume>
<issue>10</issue>
<fpage>1340</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="pmid">12376378</pub-id>
</element-citation>
</ref>
<ref id="b22-ebo-12-2016-073"><label>22</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Rouillard</surname>
<given-names>JM</given-names>
</name>
<name><surname>Zuker</surname>
<given-names>M</given-names>
</name>
<name><surname>Gulari</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>OligoArray 2.0: design of oligonucleotide probes for DNA microarrays using a thermodynamic approach</article-title>
<source>Nucleic Acids Res</source>
<year>2003</year>
<volume>31</volume>
<issue>12</issue>
<fpage>3057</fpage>
<lpage>62</lpage>
<pub-id pub-id-type="pmid">12799432</pub-id>
</element-citation>
</ref>
<ref id="b23-ebo-12-2016-073"><label>23</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wernersson</surname>
<given-names>R</given-names>
</name>
<name><surname>Nielsen</surname>
<given-names>HB</given-names>
</name>
</person-group>
<article-title>OligoWiz 2.0 – integrating sequence feature annotation into the design of microarray probes</article-title>
<source>Nucleic Acids Res</source>
<year>2005</year>
<volume>33</volume>
<issue>suppl 2</issue>
<fpage>W611</fpage>
<lpage>5</lpage>
<pub-id pub-id-type="pmid">15980547</pub-id>
</element-citation>
</ref>
<ref id="b24-ebo-12-2016-073"><label>24</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Nordberg</surname>
<given-names>EK</given-names>
</name>
</person-group>
<article-title>YODA: selecting signature oligonucleotides</article-title>
<source>Bioinformatics</source>
<year>2005</year>
<volume>21</volume>
<issue>8</issue>
<fpage>1365</fpage>
<lpage>70</lpage>
<pub-id pub-id-type="pmid">15572465</pub-id>
</element-citation>
</ref>
<ref id="b25-ebo-12-2016-073"><label>25</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Ashelford</surname>
<given-names>KE</given-names>
</name>
<name><surname>Weightman</surname>
<given-names>AJ</given-names>
</name>
<name><surname>Fry</surname>
<given-names>JC</given-names>
</name>
</person-group>
<article-title>PRIMROSE: a computer program for generating and estimating the phylogenetic range of 16S rRNA oligonucleotide probes and primers in conjunction with the RDP – II database</article-title>
<source>Nucleic Acids Res</source>
<year>2002</year>
<volume>30</volume>
<issue>15</issue>
<fpage>3481</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="pmid">12140334</pub-id>
</element-citation>
</ref>
<ref id="b26-ebo-12-2016-073"><label>26</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Ludwig</surname>
<given-names>W</given-names>
</name>
<name><surname>Strunk</surname>
<given-names>O</given-names>
</name>
<name><surname>Westram</surname>
<given-names>R</given-names>
</name>
<etal></etal>
</person-group>
<article-title>ARB: a software environment for sequence data</article-title>
<source>Nucleic Acids Res</source>
<year>2004</year>
<volume>32</volume>
<issue>4</issue>
<fpage>1363</fpage>
<lpage>71</lpage>
<pub-id pub-id-type="pmid">14985472</pub-id>
</element-citation>
</ref>
<ref id="b27-ebo-12-2016-073"><label>27</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Adams</surname>
<given-names>MD</given-names>
</name>
<name><surname>Kelley</surname>
<given-names>JM</given-names>
</name>
<name><surname>Gocayne</surname>
<given-names>JD</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Complementary DNA sequencing: expressed sequence tags and human genome project</article-title>
<source>Science</source>
<year>1991</year>
<volume>252</volume>
<issue>5013</issue>
<fpage>1651</fpage>
<lpage>6</lpage>
<pub-id pub-id-type="pmid">2047873</pub-id>
</element-citation>
</ref>
<ref id="b28-ebo-12-2016-073"><label>28</label>
<element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Baxevanis</surname>
<given-names>AD</given-names>
</name>
<name><surname>Ouellette</surname>
<given-names>BF</given-names>
</name>
</person-group>
<source>Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins</source>
<volume>43</volume>
<publisher-name>John Wiley & Sons</publisher-name>
<year>2004</year>
</element-citation>
</ref>
<ref id="b29-ebo-12-2016-073"><label>29</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Choudhary</surname>
<given-names>M</given-names>
</name>
<name><surname>Mackenzie</surname>
<given-names>C</given-names>
</name>
<name><surname>Nereng</surname>
<given-names>KS</given-names>
</name>
<name><surname>Sodergren</surname>
<given-names>E</given-names>
</name>
<name><surname>Weinstock</surname>
<given-names>GM</given-names>
</name>
<name><surname>Kaplan</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Multiple chromosomes in bacteria: structure and function of chromosome II of Rhodobacter sphaeroides 2.4. 1t</article-title>
<source>J Bacteriol</source>
<year>1994</year>
<volume>176</volume>
<issue>24</issue>
<fpage>7694</fpage>
<lpage>702</lpage>
<pub-id pub-id-type="pmid">8002595</pub-id>
</element-citation>
</ref>
<ref id="b30-ebo-12-2016-073"><label>30</label>
<element-citation publication-type="webpage"><person-group person-group-type="author"><collab>Apache Hadoop</collab>
</person-group>
<comment>Available at: <ext-link ext-link-type="uri" xlink:href="http://hadoop.apache.org/">http://hadoop.apache.org/</ext-link>
</comment>
</element-citation>
</ref>
<ref id="b31-ebo-12-2016-073"><label>31</label>
<element-citation publication-type="book"><person-group person-group-type="author"><name><surname>White</surname>
<given-names>T</given-names>
</name>
</person-group>
<source>Hadoop: The Definitive Guide</source>
<publisher-name>O’Reilly Media, Inc.</publisher-name>
<year>2012</year>
</element-citation>
</ref>
<ref id="b32-ebo-12-2016-073"><label>32</label>
<element-citation publication-type="webpage"><person-group person-group-type="author"><collab>Cloudera</collab>
</person-group>
<source>Hadoop and Big Data</source>
<comment>Available at: <ext-link ext-link-type="uri" xlink:href="http://www.cloudera.com/content/cloudera/en/about/hadoop-and-big-data.html">http://www.cloudera.com/content/cloudera/en/about/hadoop-and-big-data.html</ext-link>
</comment>
</element-citation>
</ref>
<ref id="b33-ebo-12-2016-073"><label>33</label>
<element-citation publication-type="confproc"><person-group person-group-type="author"><name><surname>Shvachko</surname>
<given-names>K</given-names>
</name>
<name><surname>Kuang</surname>
<given-names>H</given-names>
</name>
<name><surname>Radia</surname>
<given-names>S</given-names>
</name>
<name><surname>Chansler</surname>
<given-names>R</given-names>
</name>
</person-group>
<source>The Hadoop Distributed File System</source>
<conf-name>Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium On</conf-name>
<publisher-loc>Incline Village, Nevada, USA</publisher-loc>
<year>2010</year>
<fpage>1</fpage>
<lpage>10</lpage>
</element-citation>
</ref>
<ref id="b34-ebo-12-2016-073"><label>34</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Dean</surname>
<given-names>J</given-names>
</name>
<name><surname>Ghemawat</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Mapreduce: simplified data processing on large clusters</article-title>
<source>Commun ACM</source>
<year>2008</year>
<volume>51</volume>
<issue>1</issue>
<fpage>107</fpage>
<lpage>13</lpage>
</element-citation>
</ref>
<ref id="b35-ebo-12-2016-073"><label>35</label>
<element-citation publication-type="confproc"><person-group person-group-type="author"><name><surname>Battre</surname>
<given-names>D</given-names>
</name>
<name><surname>Ewen</surname>
<given-names>S</given-names>
</name>
<name><surname>Hueske</surname>
<given-names>F</given-names>
</name>
<name><surname>Kao</surname>
<given-names>O</given-names>
</name>
<name><surname>Markl</surname>
<given-names>V</given-names>
</name>
<name><surname>Warneke</surname>
<given-names>D</given-names>
</name>
</person-group>
<source>Nephele/pacts: a programming model and execution framework for web-scale analytical processing</source>
<conf-name>Proceedings of the 1st ACM Symposium on Cloud Computing</conf-name>
<conf-loc>Indianapolis, IN, USA</conf-loc>
<year>2010</year>
<fpage>119</fpage>
<lpage>30</lpage>
</element-citation>
</ref>
<ref id="b36-ebo-12-2016-073"><label>36</label>
<element-citation publication-type="webpage"><person-group person-group-type="author"><collab>Apache Hadoop NextGen MapReduce (YARN)</collab>
</person-group>
<comment>Available at: <ext-link ext-link-type="uri" xlink:href="http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html">http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html</ext-link>
</comment>
</element-citation>
</ref>
<ref id="b37-ebo-12-2016-073"><label>37</label>
<element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Capriolo</surname>
<given-names>E</given-names>
</name>
<name><surname>Wampler</surname>
<given-names>D</given-names>
</name>
<name><surname>Rutherglen</surname>
<given-names>J</given-names>
</name>
</person-group>
<source>Programming Hive</source>
<publisher-name>O’Reilly Media, Inc.</publisher-name>
<year>2012</year>
</element-citation>
</ref>
<ref id="b38-ebo-12-2016-073"><label>38</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Thusoo</surname>
<given-names>A</given-names>
</name>
<name><surname>Sarma</surname>
<given-names>JS</given-names>
</name>
<name><surname>Jain</surname>
<given-names>N</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Hive: a warehousing solution over a map-reduce framework</article-title>
<source>Proceedings of the VLDB Endowment</source>
<year>2009</year>
<volume>2</volume>
<issue>2</issue>
<fpage>1626</fpage>
<lpage>9</lpage>
</element-citation>
</ref>
<ref id="b39-ebo-12-2016-073"><label>39</label>
<element-citation publication-type="webpage"><person-group person-group-type="author"><collab>Apache Hive</collab>
</person-group>
<comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://hive.apache.org/">https://hive.apache.org/</ext-link>
</comment>
</element-citation>
</ref>
<ref id="b40-ebo-12-2016-073"><label>40</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Ulrich</surname>
<given-names>RL</given-names>
</name>
<name><surname>Ulrich</surname>
<given-names>MP</given-names>
</name>
<name><surname>Schell</surname>
<given-names>MA</given-names>
</name>
<name><surname>Kim</surname>
<given-names>HS</given-names>
</name>
<name><surname>DeShazer</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Development of a polymerase chain reaction assay for the specific identification of <italic>Burkholderia mallei</italic>
 and differentiation from <italic>Burkholderia pseudomallei</italic>
 and other closely related Burkholderiaceae</article-title>
<source>Diagn Microbiol Infect Dis</source>
<year>2006</year>
<volume>55</volume>
<issue>1</issue>
<fpage>37</fpage>
<lpage>45</lpage>
<pub-id pub-id-type="pmid">16546342</pub-id>
</element-citation>
</ref>
<ref id="b41-ebo-12-2016-073"><label>41</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Godoy</surname>
<given-names>D</given-names>
</name>
<name><surname>Randle</surname>
<given-names>G</given-names>
</name>
<name><surname>Simpson</surname>
<given-names>AJ</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Multilocus sequence typing and evolutionary relationships among the causative agents of melioidosis and glanders, <italic>Burkholderia pseudomallei</italic>
 and <italic>Burkholderia mallei</italic>
</article-title>
<source>J Clin Microbiol</source>
<year>2003</year>
<volume>41</volume>
<issue>5</issue>
<fpage>2068</fpage>
<lpage>79</lpage>
<pub-id pub-id-type="pmid">12734250</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
<floats-group><fig id="f1-ebo-12-2016-073" position="float"><label>Figure 1</label>
<caption><p>The three main phases of HTSFinder for detecting DNA signatures. We can repeat the second phase with the obtained results if required.</p>
</caption>
<graphic xlink:href="ebo-12-2016-073Fig1"></graphic>
</fig>
<fig id="f2-ebo-12-2016-073" position="float"><label>Figure 2</label>
<caption><p>Splitting of the genome by GkmerG for <italic>k</italic>
 = 18 to get all the possibilities of 18-mers. Generating <italic>k</italic>
-mers for a single genome with GkmerG includes: purgation, splitting, concatenation, cleaning, sorting, and removing duplicate except one. The output of GkmerG is a file containing <italic>k</italic>
-mers of a genome in a single column. The labels above the file numbers in this figure represent the beginning of four <italic>k</italic>
-mers in the head of files.</p>
</caption>
<graphic xlink:href="ebo-12-2016-073Fig2"></graphic>
</fig>
<fig id="f3-ebo-12-2016-073" position="float"><label>Figure 3</label>
<caption><p>An example of the overall MapReduce WordCount process. The original image was made by Trifork.</p>
</caption>
<graphic xlink:href="ebo-12-2016-073Fig3"></graphic>
</fig>
<fig id="f4-ebo-12-2016-073" position="float"><label>Figure 4</label>
<caption><p>The recommended process for detecting unique DNA signatures of a target database against nontarget databases. In step 2 of this figure, the frequency number of <italic>k</italic>
-mers varies from 1 to <italic>n</italic>
, where <italic>n</italic>
 is the total number of the databases that are used in the pipeline. Since there are four databases in this figure, the frequency of <italic>k</italic>
-mers in step 2 is from 1 to 4. In step 3, there are two input files with the list of non-repeated <italic>k</italic>
-mers; therefore, the frequency of <italic>k</italic>
-mers in the output is 1 or 2. Hence, <italic>k</italic>
-mers with frequency 2 that is common in both input files are the unique signatures of database 1 against all databases.</p>
</caption>
<graphic xlink:href="ebo-12-2016-073Fig4"></graphic>
</fig>
<fig id="f5-ebo-12-2016-073" position="float"><label>Figure 5</label>
<caption><p>Top 10 bacterial genomes with the highest number of unique DNA signatures in the bacterial genome database.</p>
</caption>
<graphic xlink:href="ebo-12-2016-073Fig5"></graphic>
</fig>
<fig id="f6-ebo-12-2016-073" position="float"><label>Figure 6</label>
<caption><p>Ten bacterial genomes with the highest number of signatures common with <italic>Acaryochloris_marina_MBIC11017_uid58167</italic>
 in the bacterial genomes database. This is an example of the results for common signatures with frequency 2 obtained by HTSFinder.</p>
</caption>
<graphic xlink:href="ebo-12-2016-073Fig6"></graphic>
</fig>
<table-wrap id="t1-ebo-12-2016-073" position="float"><label>Table 1</label>
<caption><p>A comparison of signature discovery algorithms according to the data format, computational resources, and ability to process single or multiple sequences.</p>
</caption>
<table frame="box" rules="rows"><thead><tr><th valign="top" align="left" rowspan="1" colspan="1">NAME</th>
<th valign="top" align="left" rowspan="1" colspan="1">DATA FORMAT</th>
<th valign="top" align="left" rowspan="1" colspan="1">ADOPTED PLATFORM ACCORDING TO THE PUBLICATION</th>
<th valign="top" align="left" rowspan="1" colspan="1">BLAST SPECIFICITY</th>
<th valign="top" align="left" rowspan="1" colspan="1">ABILITY FOR SINGLE SEQUENCE</th>
<th valign="top" align="left" rowspan="1" colspan="1">ABILITY FOR MULTIPLE SEQUENCES</th>
</tr>
</thead>
<tbody><tr><td valign="top" align="left" rowspan="1" colspan="1">TOFI</td>
<td valign="top" align="left" rowspan="1" colspan="1">FASTA</td>
<td valign="top" align="left" rowspan="1" colspan="1">64 × 1.5 GHz Itanium 2 processors running on Linux with 64 GB of shared memory</td>
<td valign="top" align="left" rowspan="1" colspan="1">✓</td>
<td valign="top" align="left" rowspan="1" colspan="1">✓</td>
<td valign="top" align="left" rowspan="1" colspan="1">×</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">TOPSI</td>
<td valign="top" align="left" rowspan="1" colspan="1">FASTA</td>
<td valign="top" align="left" rowspan="1" colspan="1">98-cores Linux cluster</td>
<td valign="top" align="left" rowspan="1" colspan="1">✓</td>
<td valign="top" align="left" rowspan="1" colspan="1">✓</td>
<td valign="top" align="left" rowspan="1" colspan="1">✓</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">Insignia</td>
<td valign="top" align="left" rowspan="1" colspan="1">FASTA</td>
<td valign="top" align="left" rowspan="1" colspan="1">192-node Linux cluster</td>
<td valign="top" align="left" rowspan="1" colspan="1">✓</td>
<td valign="top" align="left" rowspan="1" colspan="1">✓</td>
<td valign="top" align="left" rowspan="1" colspan="1">✓</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">CaSSiS</td>
<td valign="top" align="left" rowspan="1" colspan="1">rRNA</td>
<td valign="top" align="left" rowspan="1" colspan="1">Intel Core i7 CPU (4 cores, 2.67 GHz) with 24 GB of RAM</td>
<td valign="top" align="left" rowspan="1" colspan="1">×</td>
<td valign="top" align="left" rowspan="1" colspan="1">✓</td>
<td valign="top" align="left" rowspan="1" colspan="1">✓</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">CMD and PISD</td>
<td valign="top" align="left" rowspan="1" colspan="1">ESTs</td>
<td valign="top" align="left" rowspan="1" colspan="1">Dell PowerEdge R900 server with two Intel Xeon E7430 2.13 GHz quad-core CPUs, 12 GB RAM</td>
<td valign="top" align="left" rowspan="1" colspan="1">×</td>
<td valign="top" align="left" rowspan="1" colspan="1">✓</td>
<td valign="top" align="left" rowspan="1" colspan="1">×</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">IMUS</td>
<td valign="top" align="left" rowspan="1" colspan="1">ESTs</td>
<td valign="top" align="left" rowspan="1" colspan="1">Intel 2.93GHz CPU</td>
<td valign="top" align="left" rowspan="1" colspan="1">×</td>
<td valign="top" align="left" rowspan="1" colspan="1">✓</td>
<td valign="top" align="left" rowspan="1" colspan="1">×</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">PIMUS</td>
<td valign="top" align="left" rowspan="1" colspan="1">ESTs</td>
<td valign="top" align="left" rowspan="1" colspan="1">Intel Core i7 870 2.93GHz quad-core CPU and 16 GB RAM</td>
<td valign="top" align="left" rowspan="1" colspan="1">×</td>
<td valign="top" align="left" rowspan="1" colspan="1">✓</td>
<td valign="top" align="left" rowspan="1" colspan="1">×</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">DDCSD</td>
<td valign="top" align="left" rowspan="1" colspan="1">ESTs</td>
<td valign="top" align="left" rowspan="1" colspan="1">A Master node: Intel Core i7 CPU 870 at 2.93 GHz and 16 GB RAM 10 Slave nodes: Intel Core i7 CPU 3770 K at 3.50 GHz and 32 GB of RAM for each one</td>
<td valign="top" align="left" rowspan="1" colspan="1">×</td>
<td valign="top" align="left" rowspan="1" colspan="1">✓</td>
<td valign="top" align="left" rowspan="1" colspan="1">✓</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="t2-ebo-12-2016-073" position="float"><label>Table 2</label>
<caption><p>An example of Hadoop and WordCount results.</p>
</caption>
<table frame="box" rules="rows"><thead><tr><th valign="top" align="left" rowspan="1" colspan="1">SIGNATURES OR 18-MERS</th>
<th valign="top" align="left" rowspan="1" colspan="1">FREQUENCY IN THE DATABASE</th>
</tr>
</thead>
<tbody><tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAAAAAGAG</td>
<td valign="top" align="left" rowspan="1" colspan="1">8</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAAAAAGAT</td>
<td valign="top" align="left" rowspan="1" colspan="1">25</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAAAAAGCA</td>
<td valign="top" align="left" rowspan="1" colspan="1">20</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAAAAAGCC</td>
<td valign="top" align="left" rowspan="1" colspan="1">1</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAAAAAGCG</td>
<td valign="top" align="left" rowspan="1" colspan="1">5</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAAAAAGCT</td>
<td valign="top" align="left" rowspan="1" colspan="1">6</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAAAAAGGA</td>
<td valign="top" align="left" rowspan="1" colspan="1">9</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAAAAAGGC</td>
<td valign="top" align="left" rowspan="1" colspan="1">3</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAAAAAGGG</td>
<td valign="top" align="left" rowspan="1" colspan="1">6</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAAAAAGGT</td>
<td valign="top" align="left" rowspan="1" colspan="1">38</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="t3-ebo-12-2016-073" position="float"><label>Table 3</label>
<caption><p>Total number of 10 least and 10 most common signatures in the bacterial genome database.</p>
</caption>
<table frame="box" rules="rows"><thead><tr><th valign="top" align="left" rowspan="1" colspan="1">FREQUENCY (LEAST COMMON)</th>
<th valign="top" align="left" rowspan="1" colspan="1">NUMBER OF SIGNATURES IN THE DATABASE</th>
<th valign="top" align="left" rowspan="1" colspan="1">FREQUENCY (MOST COMMON)</th>
<th valign="top" align="left" rowspan="1" colspan="1">NUMBER OF 18-MERS IN THE DATABASE</th>
</tr>
</thead>
<tbody><tr><td valign="top" align="left" rowspan="1" colspan="1">1</td>
<td valign="top" align="left" rowspan="1" colspan="1">3,552,866,254</td>
<td valign="top" align="left" rowspan="1" colspan="1">2040</td>
<td valign="top" align="left" rowspan="1" colspan="1">1</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">2</td>
<td valign="top" align="left" rowspan="1" colspan="1">689,790,798</td>
<td valign="top" align="left" rowspan="1" colspan="1">2042</td>
<td valign="top" align="left" rowspan="1" colspan="1">1</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">3</td>
<td valign="top" align="left" rowspan="1" colspan="1">245,109,794</td>
<td valign="top" align="left" rowspan="1" colspan="1">2044</td>
<td valign="top" align="left" rowspan="1" colspan="1">1</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">4</td>
<td valign="top" align="left" rowspan="1" colspan="1">114,234,398</td>
<td valign="top" align="left" rowspan="1" colspan="1">2074</td>
<td valign="top" align="left" rowspan="1" colspan="1">2</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">5</td>
<td valign="top" align="left" rowspan="1" colspan="1">68,395,645</td>
<td valign="top" align="left" rowspan="1" colspan="1">2075</td>
<td valign="top" align="left" rowspan="1" colspan="1">1</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">6</td>
<td valign="top" align="left" rowspan="1" colspan="1">48,107,467</td>
<td valign="top" align="left" rowspan="1" colspan="1">2102</td>
<td valign="top" align="left" rowspan="1" colspan="1">1</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">7</td>
<td valign="top" align="left" rowspan="1" colspan="1">31,544,271</td>
<td valign="top" align="left" rowspan="1" colspan="1">2112</td>
<td valign="top" align="left" rowspan="1" colspan="1">1</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">8</td>
<td valign="top" align="left" rowspan="1" colspan="1">26,164,511</td>
<td valign="top" align="left" rowspan="1" colspan="1">2113</td>
<td valign="top" align="left" rowspan="1" colspan="1">2</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">9</td>
<td valign="top" align="left" rowspan="1" colspan="1">23,650,821</td>
<td valign="top" align="left" rowspan="1" colspan="1">2114</td>
<td valign="top" align="left" rowspan="1" colspan="1">2</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">10</td>
<td valign="top" align="left" rowspan="1" colspan="1">16,156,541</td>
<td valign="top" align="left" rowspan="1" colspan="1">2125</td>
<td valign="top" align="left" rowspan="1" colspan="1">1</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="t4-ebo-12-2016-073" position="float"><label>Table 4</label>
<caption><p>An example of the output for the third phase (the right side of the table). The reference numbers in this table indicates the numbers appended by GkmerG for easier tracking of data in the pipeline.</p>
</caption>
<table frame="box" rules="rows"><thead><tr><th valign="top" align="left" rowspan="1" colspan="1">SIGNATURE</th>
<th valign="top" align="left" rowspan="1" colspan="1">GkmerG REFERENCE NUMBER</th>
<th valign="top" align="left" rowspan="1" colspan="1">NAME OF THE BACTERIAL GENOME</th>
</tr>
</thead>
<tbody><tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAACGCTCTGATATGA</td>
<td valign="top" align="left" rowspan="1" colspan="1">1059</td>
<td valign="top" align="left" rowspan="1" colspan="1"><italic>Eubacterium_rectale_ATCC_33656_uid59169</italic>
</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAACGCTCTGCCACCA</td>
<td valign="top" align="left" rowspan="1" colspan="1">1520</td>
<td valign="top" align="left" rowspan="1" colspan="1"><italic>Methanobacterium_SWAN_1_uid67359</italic>
</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAACGCTCTGGGAATT</td>
<td valign="top" align="left" rowspan="1" colspan="1">705</td>
<td valign="top" align="left" rowspan="1" colspan="1"><italic>Chromohalobacter_salexigens_DSM_3043_uid62921</italic>
</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAACGCTCTTTTATTT</td>
<td valign="top" align="left" rowspan="1" colspan="1">472</td>
<td valign="top" align="left" rowspan="1" colspan="1"><italic>Campylobacter_hominis_ATCC_BAA_381_uid58981</italic>
</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAACGCTGAAACGCCT</td>
<td valign="top" align="left" rowspan="1" colspan="1">2649</td>
<td valign="top" align="left" rowspan="1" colspan="1"><italic>Tolumonas_auensis_DSM_9187_uid59395</italic>
</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAACGCTGAAATCCGC</td>
<td valign="top" align="left" rowspan="1" colspan="1">2013</td>
<td valign="top" align="left" rowspan="1" colspan="1"><italic>Rahnella_Y9602_uid62715</italic>
</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAACGCTGAATGAAGC</td>
<td valign="top" align="left" rowspan="1" colspan="1">39</td>
<td valign="top" align="left" rowspan="1" colspan="1"><italic>Acinetobacter_ADP1_uid61597</italic>
</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAACGCTGACAATAAA</td>
<td valign="top" align="left" rowspan="1" colspan="1">1337</td>
<td valign="top" align="left" rowspan="1" colspan="1"><italic>Lactobacillus_brevis_KB290_uid195560</italic>
</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAACGCTGACCTTCTA</td>
<td valign="top" align="left" rowspan="1" colspan="1">1</td>
<td valign="top" align="left" rowspan="1" colspan="1"><italic>Acaryochloris_marina_MBIC11017_uid58167</italic>
</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAACGCTGACGGAAGT</td>
<td valign="top" align="left" rowspan="1" colspan="1">2126</td>
<td valign="top" align="left" rowspan="1" colspan="1"><italic>Ruminococcus_albus_7_uid51721</italic>
</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="t5-ebo-12-2016-073" position="float"><label>Table 5</label>
<caption><p><italic>B. mallei</italic>
 and <italic>B. pseudomallei</italic>
 genomes with their number of unique DNA signatures of 18-mers in the bacterial genome database.</p>
</caption>
<table frame="box" rules="rows"><thead><tr><th valign="top" align="left" rowspan="1" colspan="1">THE REFERENCE NUMBER AND NAME OF THE BURKHOLDERIA GENOMES</th>
<th valign="top" align="left" rowspan="1" colspan="1">NUMBER OF UNIQUE DNA SIGNATURES</th>
</tr>
</thead>
<tbody><tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Burkholderia_mallei_ATCC_23344_uid57725</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">90,278</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Burkholderia_mallei_NCTC_10229_uid58383</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">24,858</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Burkholderia_mallei_NCTC_10247_uid58385</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">19,442</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Burkholderia_mallei_SAVP1_uid58387</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">7,649</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Burkholderia_pseudomallei_1026b_uid162511</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">282,992</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Burkholderia_pseudomallei_1106a_uid58515</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">173,688</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Burkholderia_pseudomallei_1710b_uid58391</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">41,153</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Burkholderia_pseudomallei_668_uid58389</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">218,985</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Burkholderia_pseudomallei_BPC006_uid174460</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">81,768</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Burkholderia_pseudomallei_K96243_uid57733</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">195,711</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Burkholderia_pseudomallei_MSHR305_uid213227</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">320,198</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Burkholderia_pseudomallei_MSHR346_uid55259</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">172,551</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Burkholderia_pseudomallei_NCTC_13179_uid226109</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">382,494</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="t6-ebo-12-2016-073" position="float"><label>Table 6</label>
<caption><p>A portion of results for signatures with frequencies 2 and 3 in the database. Concerning the reference numbers, most of the common signatures are shared among the phylogenetically close genomes. However, number of common signatures among unrelated species are also notable.</p>
</caption>
<table frame="box" rules="rows"><thead><tr><th valign="top" align="left" rowspan="1" colspan="1">SIGNATURES WITH FREQUENCY = 2</th>
<th colspan="2" valign="top" align="left" rowspan="1">GkmerG REFERENCE NUMBERS</th>
<th valign="top" align="left" rowspan="1" colspan="1">SIGNATURES WITH FREQUENCY = 3</th>
<th colspan="3" valign="top" align="left" rowspan="1">GkmerG REFERENCE NUMBERS</th>
</tr>
</thead>
<tbody><tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAGATAAATA</td>
<td valign="top" align="left" rowspan="1" colspan="1">355</td>
<td valign="top" align="left" rowspan="1" colspan="1">508</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAAATATCG</td>
<td valign="top" align="left" rowspan="1" colspan="1">1709</td>
<td valign="top" align="left" rowspan="1" colspan="1">1708</td>
<td valign="top" align="left" rowspan="1" colspan="1">2677</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAACAGACACAA</td>
<td valign="top" align="left" rowspan="1" colspan="1">2110</td>
<td valign="top" align="left" rowspan="1" colspan="1">2109</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAACAGAAC</td>
<td valign="top" align="left" rowspan="1" colspan="1">1249</td>
<td valign="top" align="left" rowspan="1" colspan="1">1255</td>
<td valign="top" align="left" rowspan="1" colspan="1">1267</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAACAGCATTAA</td>
<td valign="top" align="left" rowspan="1" colspan="1">2209</td>
<td valign="top" align="left" rowspan="1" colspan="1">2214</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAATAAATACA</td>
<td valign="top" align="left" rowspan="1" colspan="1">2726</td>
<td valign="top" align="left" rowspan="1" colspan="1">2734</td>
<td valign="top" align="left" rowspan="1" colspan="1">542</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAACAGGCTTAC</td>
<td valign="top" align="left" rowspan="1" colspan="1">394</td>
<td valign="top" align="left" rowspan="1" colspan="1">1499</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAGAAACAAAG</td>
<td valign="top" align="left" rowspan="1" colspan="1">681</td>
<td valign="top" align="left" rowspan="1" colspan="1">678</td>
<td valign="top" align="left" rowspan="1" colspan="1">679</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAACCGCCGAAC</td>
<td valign="top" align="left" rowspan="1" colspan="1">1046</td>
<td valign="top" align="left" rowspan="1" colspan="1">1048</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAGATGTTAAT</td>
<td valign="top" align="left" rowspan="1" colspan="1">969</td>
<td valign="top" align="left" rowspan="1" colspan="1">2384</td>
<td valign="top" align="left" rowspan="1" colspan="1">247</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAACCGCTTTTA</td>
<td valign="top" align="left" rowspan="1" colspan="1">1879</td>
<td valign="top" align="left" rowspan="1" colspan="1">1265</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAGCAAAACAA</td>
<td valign="top" align="left" rowspan="1" colspan="1">2223</td>
<td valign="top" align="left" rowspan="1" colspan="1">355</td>
<td valign="top" align="left" rowspan="1" colspan="1">102</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAACGAACAAAC</td>
<td valign="top" align="left" rowspan="1" colspan="1">101</td>
<td valign="top" align="left" rowspan="1" colspan="1">1813</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAGTAAATGCG</td>
<td valign="top" align="left" rowspan="1" colspan="1">1793</td>
<td valign="top" align="left" rowspan="1" colspan="1">2731</td>
<td valign="top" align="left" rowspan="1" colspan="1">2730</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAACGATTCAGA</td>
<td valign="top" align="left" rowspan="1" colspan="1">2106</td>
<td valign="top" align="left" rowspan="1" colspan="1">2107</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAATAGACAATG</td>
<td valign="top" align="left" rowspan="1" colspan="1">498</td>
<td valign="top" align="left" rowspan="1" colspan="1">500</td>
<td valign="top" align="left" rowspan="1" colspan="1">755</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAACTAATGCTT</td>
<td valign="top" align="left" rowspan="1" colspan="1">349</td>
<td valign="top" align="left" rowspan="1" colspan="1">355</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAATATTCATGC</td>
<td valign="top" align="left" rowspan="1" colspan="1">321</td>
<td valign="top" align="left" rowspan="1" colspan="1">897</td>
<td valign="top" align="left" rowspan="1" colspan="1">560</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAACTAATTCTG</td>
<td valign="top" align="left" rowspan="1" colspan="1">1406</td>
<td valign="top" align="left" rowspan="1" colspan="1">1408</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAATTCAAAATT</td>
<td valign="top" align="left" rowspan="1" colspan="1">567</td>
<td valign="top" align="left" rowspan="1" colspan="1">505</td>
<td valign="top" align="left" rowspan="1" colspan="1">325</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAGAACCAAAC</td>
<td valign="top" align="left" rowspan="1" colspan="1">544</td>
<td valign="top" align="left" rowspan="1" colspan="1">545</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAATTTAGCGAT</td>
<td valign="top" align="left" rowspan="1" colspan="1">2714</td>
<td valign="top" align="left" rowspan="1" colspan="1">1814</td>
<td valign="top" align="left" rowspan="1" colspan="1">321</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAGACTGACTC</td>
<td valign="top" align="left" rowspan="1" colspan="1">2696</td>
<td valign="top" align="left" rowspan="1" colspan="1">1066</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAATTTTTATAG</td>
<td valign="top" align="left" rowspan="1" colspan="1">703</td>
<td valign="top" align="left" rowspan="1" colspan="1">402</td>
<td valign="top" align="left" rowspan="1" colspan="1">324</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAGATGTTGTA</td>
<td valign="top" align="left" rowspan="1" colspan="1">545</td>
<td valign="top" align="left" rowspan="1" colspan="1">544</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAACAAGAAGCGC</td>
<td valign="top" align="left" rowspan="1" colspan="1">1426</td>
<td valign="top" align="left" rowspan="1" colspan="1">1427</td>
<td valign="top" align="left" rowspan="1" colspan="1">1428</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAGGATTCGAA</td>
<td valign="top" align="left" rowspan="1" colspan="1">1428</td>
<td valign="top" align="left" rowspan="1" colspan="1">1427</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAACAATTAGCGA</td>
<td valign="top" align="left" rowspan="1" colspan="1">1128</td>
<td valign="top" align="left" rowspan="1" colspan="1">2677</td>
<td valign="top" align="left" rowspan="1" colspan="1">2404</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAATAAAGACTC</td>
<td valign="top" align="left" rowspan="1" colspan="1">345</td>
<td valign="top" align="left" rowspan="1" colspan="1">343</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAACAGATAGTGA</td>
<td valign="top" align="left" rowspan="1" colspan="1">2115</td>
<td valign="top" align="left" rowspan="1" colspan="1">1061</td>
<td valign="top" align="left" rowspan="1" colspan="1">1508</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAATAGTGACGA</td>
<td valign="top" align="left" rowspan="1" colspan="1">1686</td>
<td valign="top" align="left" rowspan="1" colspan="1">1693</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAACAGCAGCACC</td>
<td valign="top" align="left" rowspan="1" colspan="1">2535</td>
<td valign="top" align="left" rowspan="1" colspan="1">1584</td>
<td valign="top" align="left" rowspan="1" colspan="1">1058</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="t7-ebo-12-2016-073" position="float"><label>Table 7</label>
<caption><p>Number of unique DNA signatures in the human genome and its three chromosomes with different sequence sizes for a series of lengths of <italic>k</italic>
-mers from 21 to 30.</p>
</caption>
<table frame="box" rules="rows"><thead><tr><th valign="top" align="left" rowspan="1" colspan="1">LENGTH OF SIGNATURE</th>
<th valign="top" align="left" rowspan="1" colspan="1">THE WHOLE GENOME (2.8 GB)</th>
<th valign="top" align="left" rowspan="1" colspan="1">CHR1 (222 MB)</th>
<th valign="top" align="left" rowspan="1" colspan="1">CHRX (147 MB)</th>
<th valign="top" align="left" rowspan="1" colspan="1">CHRY (19 MB)</th>
</tr>
</thead>
<tbody><tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>k</italic>
 = 21</td>
<td valign="top" align="left" rowspan="1" colspan="1">2.24297e+09</td>
<td valign="top" align="left" rowspan="1" colspan="1">176,137,004</td>
<td valign="top" align="left" rowspan="1" colspan="1">109,691,126</td>
<td valign="top" align="left" rowspan="1" colspan="1">10,221,240</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>k</italic>
 = 22</td>
<td valign="top" align="left" rowspan="1" colspan="1">2.28624e+09</td>
<td valign="top" align="left" rowspan="1" colspan="1">179,436,876</td>
<td valign="top" align="left" rowspan="1" colspan="1">112,370,062</td>
<td valign="top" align="left" rowspan="1" colspan="1">10,550,076</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>k</italic>
 = 23</td>
<td valign="top" align="left" rowspan="1" colspan="1">2.31954e+09</td>
<td valign="top" align="left" rowspan="1" colspan="1">181,982,115</td>
<td valign="top" align="left" rowspan="1" colspan="1">114,505,744</td>
<td valign="top" align="left" rowspan="1" colspan="1">10,825,761</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>k</italic>
 = 24</td>
<td valign="top" align="left" rowspan="1" colspan="1">2.34792e+09</td>
<td valign="top" align="left" rowspan="1" colspan="1">184,157,371</td>
<td valign="top" align="left" rowspan="1" colspan="1">116,349,017</td>
<td valign="top" align="left" rowspan="1" colspan="1">11,070,875</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>k</italic>
 = 25</td>
<td valign="top" align="left" rowspan="1" colspan="1">2.37333e+09</td>
<td valign="top" align="left" rowspan="1" colspan="1">186,108,431</td>
<td valign="top" align="left" rowspan="1" colspan="1">117,999,580</td>
<td valign="top" align="left" rowspan="1" colspan="1">11,294,439</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>k</italic>
 = 26</td>
<td valign="top" align="left" rowspan="1" colspan="1">2.39664e+09</td>
<td valign="top" align="left" rowspan="1" colspan="1">187,904,867</td>
<td valign="top" align="left" rowspan="1" colspan="1">119,504,725</td>
<td valign="top" align="left" rowspan="1" colspan="1">11,501,139</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>k</italic>
 = 27</td>
<td valign="top" align="left" rowspan="1" colspan="1">2.41829e+09</td>
<td valign="top" align="left" rowspan="1" colspan="1">189,580,382</td>
<td valign="top" align="left" rowspan="1" colspan="1">120,891,039</td>
<td valign="top" align="left" rowspan="1" colspan="1">11,693,180</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>k</italic>
 = 28</td>
<td valign="top" align="left" rowspan="1" colspan="1">2.43849e+09</td>
<td valign="top" align="left" rowspan="1" colspan="1">191,150,531</td>
<td valign="top" align="left" rowspan="1" colspan="1">122,172,724</td>
<td valign="top" align="left" rowspan="1" colspan="1">11,872,250</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>k</italic>
 = 29</td>
<td valign="top" align="left" rowspan="1" colspan="1">2.45744e+09</td>
<td valign="top" align="left" rowspan="1" colspan="1">192,629,345</td>
<td valign="top" align="left" rowspan="1" colspan="1">123,363,397</td>
<td valign="top" align="left" rowspan="1" colspan="1">12,039,734</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>k</italic>
 = 30</td>
<td valign="top" align="left" rowspan="1" colspan="1">2.47529e+09</td>
<td valign="top" align="left" rowspan="1" colspan="1">194,027,911</td>
<td valign="top" align="left" rowspan="1" colspan="1">124,472,828</td>
<td valign="top" align="left" rowspan="1" colspan="1">12,196,710</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="t8-ebo-12-2016-073" position="float"><label>Table 8</label>
<caption><p>A comparison of five <italic>Bacillus</italic>
 strains with the highest number of unique signatures and five others with the lowest number of signatures of length 18 within species and in the entire database. This table shows that within-species similarity and variability have more influence on the volume of signatures than the remainder of the database.</p>
</caption>
<table frame="box" rules="rows"><thead><tr><th valign="top" align="left" rowspan="1" colspan="1">NAME OF STRAINS</th>
<th valign="top" align="left" rowspan="1" colspan="1">WITHIN-SPECIES</th>
<th valign="top" align="left" rowspan="1" colspan="1">IN THE ENTIRE DATABASE</th>
<th valign="top" align="left" rowspan="1" colspan="1">THE ORIGINAL GENOME SIZE</th>
</tr>
</thead>
<tbody><tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Bacillus_megaterium_WSH_002_uid159841</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">4,860,315</td>
<td valign="top" align="left" rowspan="1" colspan="1">4,012,591</td>
<td valign="top" align="left" rowspan="1" colspan="1">5,0 MB</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Bacillus_infantis_NRRL_B_14911_uid222804</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">4,712,042</td>
<td valign="top" align="left" rowspan="1" colspan="1">3,932,760</td>
<td valign="top" align="left" rowspan="1" colspan="1">4,8 MB</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Bacillus_1NLA3E_uid81841</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">4,527,694</td>
<td valign="top" align="left" rowspan="1" colspan="1">3,734,930</td>
<td valign="top" align="left" rowspan="1" colspan="1">4,7 MB</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Bacillus_cellulosilyticus_DSM_2522_uid43329</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">4,441,938</td>
<td valign="top" align="left" rowspan="1" colspan="1">3,688,824</td>
<td valign="top" align="left" rowspan="1" colspan="1">4,6 MB</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Bacillus_clausii_KSM_K16_uid58237</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">4,177,156</td>
<td valign="top" align="left" rowspan="1" colspan="1">3,576,848</td>
<td valign="top" align="left" rowspan="1" colspan="1">4,2 MB</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Bacillus_subtilis_168_uid57675</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">248</td>
<td valign="top" align="left" rowspan="1" colspan="1">205</td>
<td valign="top" align="left" rowspan="1" colspan="1">4,1 MB</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Bacillus_amyloliquefaciens_CC178_uid226115</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">247</td>
<td valign="top" align="left" rowspan="1" colspan="1">202</td>
<td valign="top" align="left" rowspan="1" colspan="1">3,8 MB</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Bacillus_anthracis_A0248_uid59385</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">0</td>
<td valign="top" align="left" rowspan="1" colspan="1">0</td>
<td valign="top" align="left" rowspan="1" colspan="1">5,4 MB</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Bacillus_anthracis_A2012_uid54101</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">0</td>
<td valign="top" align="left" rowspan="1" colspan="1">0</td>
<td valign="top" align="left" rowspan="1" colspan="1">284 KB</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1"><italic>Bacillus_anthracis_Ames_Ancestor_uid58083</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">0</td>
<td valign="top" align="left" rowspan="1" colspan="1">0</td>
<td valign="top" align="left" rowspan="1" colspan="1">5,4 MB</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="t9-ebo-12-2016-073" position="float"><label>Table 9</label>
<caption><p>Number of unique DNA signatures for the forward bacterial genome database as the target and two other nontarget databases.</p>
</caption>
<table frame="box" rules="rows"><thead><tr><th valign="top" align="left" rowspan="1" colspan="1">DATABASES</th>
<th valign="top" align="left" rowspan="1" colspan="1">FILE SIZE</th>
<th valign="top" align="left" rowspan="1" colspan="1">NUMBER OF SIGNATURES</th>
</tr>
</thead>
<tbody><tr><td valign="top" align="left" rowspan="1" colspan="1">Unique signatures of the Forward bacterial genome database</td>
<td valign="top" align="left" rowspan="1" colspan="1">67.5 GB</td>
<td valign="top" align="left" rowspan="1" colspan="1">3,552,866,254</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">Forward + Reverse-Complement bacterial genome databases</td>
<td valign="top" align="left" rowspan="1" colspan="1">52.53 GB</td>
<td valign="top" align="left" rowspan="1" colspan="1">2,764,759,739</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">Forward + Reverse-Complement bacterial genome + Human genome databases</td>
<td valign="top" align="left" rowspan="1" colspan="1">50.28 GB</td>
<td valign="top" align="left" rowspan="1" colspan="1">2,646,494,945</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="t10-ebo-12-2016-073" position="float"><label>Table 10</label>
<caption><p>A comparison of computational results of the first and second platforms in the second and third phases of the pipeline in order to find unique DNA signatures and their related species in the forward genome database (time in minutes).</p>
</caption>
<table frame="box" rules="rows"><thead><tr><th valign="top" align="left" rowspan="1" colspan="1">STEPS</th>
<th valign="top" align="left" rowspan="1" colspan="1">FILE SIZE (GB)</th>
<th colspan="2" valign="top" align="left" rowspan="1">TIME FOR THE FIRST PLATFORM</th>
<th valign="top" align="left" rowspan="1" colspan="1">TIME FOR THE SECOND PLATFORM</th>
</tr>
</thead>
<tbody><tr><td valign="top" align="left" rowspan="1" colspan="1">Copy <italic>k</italic>
-mers generated by GkmerG to the HDFS</td>
<td valign="top" align="left" rowspan="1" colspan="1">177.35</td>
<td valign="top" align="left" rowspan="1" colspan="1">60</td>
<td colspan="2" valign="top" align="left" rowspan="1">63</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">WordCount process</td>
<td valign="top" align="left" rowspan="1" colspan="1">177.35</td>
<td valign="top" align="left" rowspan="1" colspan="1">447</td>
<td colspan="2" valign="top" align="left" rowspan="1">1169</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">Copy the result from HDFS to a local directory</td>
<td valign="top" align="left" rowspan="1" colspan="1">103.03</td>
<td valign="top" align="left" rowspan="1" colspan="1">34</td>
<td colspan="2" valign="top" align="left" rowspan="1">27</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">Extracting unique signatures and creating tables in Hive</td>
<td valign="top" align="left" rowspan="1" colspan="1">67.5</td>
<td valign="top" align="left" rowspan="1" colspan="1">60</td>
<td colspan="2" valign="top" align="left" rowspan="1">60</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">Loading unique signatures to the Hive table</td>
<td valign="top" align="left" rowspan="1" colspan="1">67.5</td>
<td valign="top" align="left" rowspan="1" colspan="1">23</td>
<td colspan="2" valign="top" align="left" rowspan="1">26</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">Loading <italic>k</italic>
-mers and reference numbers to the Hive table</td>
<td valign="top" align="left" rowspan="1" colspan="1">220.35</td>
<td valign="top" align="left" rowspan="1" colspan="1">79</td>
<td colspan="2" valign="top" align="left" rowspan="1">83</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">Executing the queries and copy the result to a local directory</td>
<td valign="top" align="left" rowspan="1" colspan="1">83.83</td>
<td valign="top" align="left" rowspan="1" colspan="1">1120</td>
<td colspan="2" valign="top" align="left" rowspan="1">959</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">Total computational time</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="left" rowspan="1" colspan="1">1823</td>
<td colspan="2" valign="top" align="left" rowspan="1">2387</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="t11-ebo-12-2016-073" position="float"><label>Table 11</label>
<caption><p>A comparison of loading and execution times of the frequencies 1–3 in the third phase.</p>
</caption>
<table frame="box" rules="rows"><thead><tr><th valign="top" align="left" rowspan="1" colspan="1">FREQUENCY</th>
<th valign="top" align="left" rowspan="1" colspan="1">1</th>
<th valign="top" align="left" rowspan="1" colspan="1">2</th>
<th valign="top" align="left" rowspan="1" colspan="1">3</th>
</tr>
</thead>
<tbody><tr><td valign="top" align="left" rowspan="1" colspan="1">Size of the file containing signatures (GB)</td>
<td valign="top" align="left" rowspan="1" colspan="1">67.5</td>
<td valign="top" align="left" rowspan="1" colspan="1">13.1</td>
<td valign="top" align="left" rowspan="1" colspan="1">4.66</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">Time for loading file into the Hive table (minutes)</td>
<td valign="top" align="left" rowspan="1" colspan="1">23</td>
<td valign="top" align="left" rowspan="1" colspan="1">4</td>
<td valign="top" align="left" rowspan="1" colspan="1">1</td>
</tr>
<tr><td valign="top" align="left" rowspan="1" colspan="1">Execution time and copy the result to local directory (minutes)</td>
<td valign="top" align="left" rowspan="1" colspan="1">1120</td>
<td valign="top" align="left" rowspan="1" colspan="1">661</td>
<td valign="top" align="left" rowspan="1" colspan="1">557</td>
</tr>
</tbody>
</table>
</table-wrap>
</floats-group>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000090  | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000090  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration MERS

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri