Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

HTSFinder: Powerful Pipeline of DNA Signature Discovery by Parallel and Distributed Computing

Identifieur interne : 000B47 ( Pmc/Checkpoint ); précédent : 000B46; suivant : 000B48

HTSFinder: Powerful Pipeline of DNA Signature Discovery by Parallel and Distributed Computing

Auteurs : Ramin Karimi [Hongrie] ; Andras Hajdu [Hongrie]

Source :

RBID : PMC:4750899

Abstract

Comprehensive effort for low-cost sequencing in the past few years has led to the growth of complete genome databases. In parallel with this effort, a strong need, fast and cost-effective methods and applications have been developed to accelerate sequence analysis. Identification is the very first step of this task. Due to the difficulties, high costs, and computational challenges of alignment-based approaches, an alternative universal identification method is highly required. Like an alignment-free approach, DNA signatures have provided new opportunities for the rapid identification of species. In this paper, we present an effective pipeline HTSFinder (high-throughput signature finder) with a corresponding k-mer generator GkmerG (genome k-mers generator). Using this pipeline, we determine the frequency of k-mers from the available complete genome databases for the detection of extensive DNA signatures in a reasonably short time. Our application can detect both unique and common signatures in the arbitrarily selected target and nontarget databases. Hadoop and MapReduce as parallel and distributed computing tools with commodity hardware are used in this pipeline. This approach brings the power of high-performance computing into the ordinary desktop personal computers for discovering DNA signatures in large databases such as bacterial genome. A considerable number of detected unique and common DNA signatures of the target database bring the opportunities to improve the identification process not only for polymerase chain reaction and microarray assays but also for more complex scenarios such as metagenomics and next-generation sequencing analysis.


Url:
DOI: 10.4137/EBO.S35545
PubMed: 26884678
PubMed Central: 4750899


Affiliations:


Links toward previous steps (curation, corpus...)


Links to Exploration step

PMC:4750899

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">HTSFinder: Powerful Pipeline of DNA Signature Discovery by Parallel and Distributed Computing</title>
<author>
<name sortKey="Karimi, Ramin" sort="Karimi, Ramin" uniqKey="Karimi R" first="Ramin" last="Karimi">Ramin Karimi</name>
<affiliation wicri:level="1">
<nlm:aff id="af1-ebo-12-2016-073">Faculty of Informatics, Department of Computer Graphics and Image Processing, University of Debrecen, Debrecen, Hungary.</nlm:aff>
<country xml:lang="fr" wicri:curation="lc">Hongrie</country>
<wicri:regionArea>Faculty of Informatics, Department of Computer Graphics and Image Processing, University of Debrecen, Debrecen</wicri:regionArea>
<wicri:noRegion>Debrecen</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Hajdu, Andras" sort="Hajdu, Andras" uniqKey="Hajdu A" first="Andras" last="Hajdu">Andras Hajdu</name>
<affiliation wicri:level="1">
<nlm:aff id="af1-ebo-12-2016-073">Faculty of Informatics, Department of Computer Graphics and Image Processing, University of Debrecen, Debrecen, Hungary.</nlm:aff>
<country xml:lang="fr" wicri:curation="lc">Hongrie</country>
<wicri:regionArea>Faculty of Informatics, Department of Computer Graphics and Image Processing, University of Debrecen, Debrecen</wicri:regionArea>
<wicri:noRegion>Debrecen</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="af2-ebo-12-2016-073">Bioinformatics Research Group, University of Debrecen, Debrecen, Hungary.</nlm:aff>
<country xml:lang="fr" wicri:curation="lc">Hongrie</country>
<wicri:regionArea>Bioinformatics Research Group, University of Debrecen, Debrecen</wicri:regionArea>
<wicri:noRegion>Debrecen</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">26884678</idno>
<idno type="pmc">4750899</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4750899</idno>
<idno type="RBID">PMC:4750899</idno>
<idno type="doi">10.4137/EBO.S35545</idno>
<date when="2016">2016</date>
<idno type="wicri:Area/Pmc/Corpus">000090</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000090</idno>
<idno type="wicri:Area/Pmc/Curation">000090</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">000090</idno>
<idno type="wicri:Area/Pmc/Checkpoint">000B47</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Checkpoint">000B47</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">HTSFinder: Powerful Pipeline of DNA Signature Discovery by Parallel and Distributed Computing</title>
<author>
<name sortKey="Karimi, Ramin" sort="Karimi, Ramin" uniqKey="Karimi R" first="Ramin" last="Karimi">Ramin Karimi</name>
<affiliation wicri:level="1">
<nlm:aff id="af1-ebo-12-2016-073">Faculty of Informatics, Department of Computer Graphics and Image Processing, University of Debrecen, Debrecen, Hungary.</nlm:aff>
<country xml:lang="fr" wicri:curation="lc">Hongrie</country>
<wicri:regionArea>Faculty of Informatics, Department of Computer Graphics and Image Processing, University of Debrecen, Debrecen</wicri:regionArea>
<wicri:noRegion>Debrecen</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Hajdu, Andras" sort="Hajdu, Andras" uniqKey="Hajdu A" first="Andras" last="Hajdu">Andras Hajdu</name>
<affiliation wicri:level="1">
<nlm:aff id="af1-ebo-12-2016-073">Faculty of Informatics, Department of Computer Graphics and Image Processing, University of Debrecen, Debrecen, Hungary.</nlm:aff>
<country xml:lang="fr" wicri:curation="lc">Hongrie</country>
<wicri:regionArea>Faculty of Informatics, Department of Computer Graphics and Image Processing, University of Debrecen, Debrecen</wicri:regionArea>
<wicri:noRegion>Debrecen</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="af2-ebo-12-2016-073">Bioinformatics Research Group, University of Debrecen, Debrecen, Hungary.</nlm:aff>
<country xml:lang="fr" wicri:curation="lc">Hongrie</country>
<wicri:regionArea>Bioinformatics Research Group, University of Debrecen, Debrecen</wicri:regionArea>
<wicri:noRegion>Debrecen</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Evolutionary Bioinformatics Online</title>
<idno type="eISSN">1176-9343</idno>
<imprint>
<date when="2016">2016</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>Comprehensive effort for low-cost sequencing in the past few years has led to the growth of complete genome databases. In parallel with this effort, a strong need, fast and cost-effective methods and applications have been developed to accelerate sequence analysis. Identification is the very first step of this task. Due to the difficulties, high costs, and computational challenges of alignment-based approaches, an alternative universal identification method is highly required. Like an alignment-free approach, DNA signatures have provided new opportunities for the rapid identification of species. In this paper, we present an effective pipeline HTSFinder (high-throughput signature finder) with a corresponding
<italic>k</italic>
-mer generator GkmerG (genome
<italic>k</italic>
-mers generator). Using this pipeline, we determine the frequency of
<italic>k</italic>
-mers from the available complete genome databases for the detection of extensive DNA signatures in a reasonably short time. Our application can detect both unique and common signatures in the arbitrarily selected target and nontarget databases. Hadoop and MapReduce as parallel and distributed computing tools with commodity hardware are used in this pipeline. This approach brings the power of high-performance computing into the ordinary desktop personal computers for discovering DNA signatures in large databases such as bacterial genome. A considerable number of detected unique and common DNA signatures of the target database bring the opportunities to improve the identification process not only for polymerase chain reaction and microarray assays but also for more complex scenarios such as metagenomics and next-generation sequencing analysis.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Kaderali, L" uniqKey="Kaderali L">L Kaderali</name>
</author>
<author>
<name sortKey="Schliep, A" uniqKey="Schliep A">A Schliep</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Francois, P" uniqKey="Francois P">P Francois</name>
</author>
<author>
<name sortKey="Charbonnier, Y" uniqKey="Charbonnier Y">Y Charbonnier</name>
</author>
<author>
<name sortKey="Jacquet, J" uniqKey="Jacquet J">J Jacquet</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, F" uniqKey="Li F">F Li</name>
</author>
<author>
<name sortKey="Stormo, Gd" uniqKey="Stormo G">GD Stormo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Coenye, T" uniqKey="Coenye T">T Coenye</name>
</author>
<author>
<name sortKey="Vandamme, P" uniqKey="Vandamme P">P Vandamme</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="V Trovsk, T" uniqKey="V Trovsk T">T Větrovský</name>
</author>
<author>
<name sortKey="Baldrian, P" uniqKey="Baldrian P">P Baldrian</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wooley, Jc" uniqKey="Wooley J">JC Wooley</name>
</author>
<author>
<name sortKey="Godzik, A" uniqKey="Godzik A">A Godzik</name>
</author>
<author>
<name sortKey="Friedberg, I" uniqKey="Friedberg I">I Friedberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tembe, W" uniqKey="Tembe W">W Tembe</name>
</author>
<author>
<name sortKey="Zavaljevski, N" uniqKey="Zavaljevski N">N Zavaljevski</name>
</author>
<author>
<name sortKey="Bode, E" uniqKey="Bode E">E Bode</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Satya, Rv" uniqKey="Satya R">RV Satya</name>
</author>
<author>
<name sortKey="Zavaljevski, N" uniqKey="Zavaljevski N">N Zavaljevski</name>
</author>
<author>
<name sortKey="Kumar, K" uniqKey="Kumar K">K Kumar</name>
</author>
<author>
<name sortKey="Reifman, J" uniqKey="Reifman J">J Reifman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Satya, Rv" uniqKey="Satya R">RV Satya</name>
</author>
<author>
<name sortKey="Kumar, K" uniqKey="Kumar K">K Kumar</name>
</author>
<author>
<name sortKey="Zavaljevski, N" uniqKey="Zavaljevski N">N Zavaljevski</name>
</author>
<author>
<name sortKey="Reifman, J" uniqKey="Reifman J">J Reifman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vijaya Satya, R" uniqKey="Vijaya Satya R">R Vijaya Satya</name>
</author>
<author>
<name sortKey="Zavaljevski, N" uniqKey="Zavaljevski N">N Zavaljevski</name>
</author>
<author>
<name sortKey="Kumar, K" uniqKey="Kumar K">K Kumar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Phillippy, Am" uniqKey="Phillippy A">AM Phillippy</name>
</author>
<author>
<name sortKey="Ayanbule, K" uniqKey="Ayanbule K">K Ayanbule</name>
</author>
<author>
<name sortKey="Edwards, Nj" uniqKey="Edwards N">NJ Edwards</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
<author>
<name sortKey="Phillippy, A" uniqKey="Phillippy A">A Phillippy</name>
</author>
<author>
<name sortKey="Delcher, Al" uniqKey="Delcher A">AL Delcher</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bader, Kc" uniqKey="Bader K">KC Bader</name>
</author>
<author>
<name sortKey="Grothoff, C" uniqKey="Grothoff C">C Grothoff</name>
</author>
<author>
<name sortKey="Meier, H" uniqKey="Meier H">H Meier</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lee, Hp" uniqKey="Lee H">HP Lee</name>
</author>
<author>
<name sortKey="Sheu, T F" uniqKey="Sheu T">T-F Sheu</name>
</author>
<author>
<name sortKey="Tang, Cy" uniqKey="Tang C">CY Tang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lee, Hp" uniqKey="Lee H">HP Lee</name>
</author>
<author>
<name sortKey="Sheu, Tf" uniqKey="Sheu T">TF Sheu</name>
</author>
<author>
<name sortKey="Tsai, Yt" uniqKey="Tsai Y">YT Tsai</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zheng, J" uniqKey="Zheng J">J Zheng</name>
</author>
<author>
<name sortKey="Close, Tj" uniqKey="Close T">TJ Close</name>
</author>
<author>
<name sortKey="Jiang, T" uniqKey="Jiang T">T Jiang</name>
</author>
<author>
<name sortKey="Lonardi, S" uniqKey="Lonardi S">S Lonardi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lee, Hp" uniqKey="Lee H">HP Lee</name>
</author>
<author>
<name sortKey="Huang, Y H" uniqKey="Huang Y">Y-H Huang</name>
</author>
<author>
<name sortKey="Sheu, Tf" uniqKey="Sheu T">TF Sheu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lee, Hp" uniqKey="Lee H">HP Lee</name>
</author>
<author>
<name sortKey="Sheu, Tf" uniqKey="Sheu T">TF Sheu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G Marcais</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kaderali, L" uniqKey="Kaderali L">L Kaderali</name>
</author>
<author>
<name sortKey="Schliep, A" uniqKey="Schliep A">A Schliep</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rouillard, Jm" uniqKey="Rouillard J">JM Rouillard</name>
</author>
<author>
<name sortKey="Zuker, M" uniqKey="Zuker M">M Zuker</name>
</author>
<author>
<name sortKey="Gulari, E" uniqKey="Gulari E">E Gulari</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wernersson, R" uniqKey="Wernersson R">R Wernersson</name>
</author>
<author>
<name sortKey="Nielsen, Hb" uniqKey="Nielsen H">HB Nielsen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nordberg, Ek" uniqKey="Nordberg E">EK Nordberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ashelford, Ke" uniqKey="Ashelford K">KE Ashelford</name>
</author>
<author>
<name sortKey="Weightman, Aj" uniqKey="Weightman A">AJ Weightman</name>
</author>
<author>
<name sortKey="Fry, Jc" uniqKey="Fry J">JC Fry</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ludwig, W" uniqKey="Ludwig W">W Ludwig</name>
</author>
<author>
<name sortKey="Strunk, O" uniqKey="Strunk O">O Strunk</name>
</author>
<author>
<name sortKey="Westram, R" uniqKey="Westram R">R Westram</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Adams, Md" uniqKey="Adams M">MD Adams</name>
</author>
<author>
<name sortKey="Kelley, Jm" uniqKey="Kelley J">JM Kelley</name>
</author>
<author>
<name sortKey="Gocayne, Jd" uniqKey="Gocayne J">JD Gocayne</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Baxevanis, Ad" uniqKey="Baxevanis A">AD Baxevanis</name>
</author>
<author>
<name sortKey="Ouellette, Bf" uniqKey="Ouellette B">BF Ouellette</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Choudhary, M" uniqKey="Choudhary M">M Choudhary</name>
</author>
<author>
<name sortKey="Mackenzie, C" uniqKey="Mackenzie C">C Mackenzie</name>
</author>
<author>
<name sortKey="Nereng, Ks" uniqKey="Nereng K">KS Nereng</name>
</author>
<author>
<name sortKey="Sodergren, E" uniqKey="Sodergren E">E Sodergren</name>
</author>
<author>
<name sortKey="Weinstock, Gm" uniqKey="Weinstock G">GM Weinstock</name>
</author>
<author>
<name sortKey="Kaplan, S" uniqKey="Kaplan S">S Kaplan</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="White, T" uniqKey="White T">T White</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shvachko, K" uniqKey="Shvachko K">K Shvachko</name>
</author>
<author>
<name sortKey="Kuang, H" uniqKey="Kuang H">H Kuang</name>
</author>
<author>
<name sortKey="Radia, S" uniqKey="Radia S">S Radia</name>
</author>
<author>
<name sortKey="Chansler, R" uniqKey="Chansler R">R Chansler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dean, J" uniqKey="Dean J">J Dean</name>
</author>
<author>
<name sortKey="Ghemawat, S" uniqKey="Ghemawat S">S Ghemawat</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Battre, D" uniqKey="Battre D">D Battre</name>
</author>
<author>
<name sortKey="Ewen, S" uniqKey="Ewen S">S Ewen</name>
</author>
<author>
<name sortKey="Hueske, F" uniqKey="Hueske F">F Hueske</name>
</author>
<author>
<name sortKey="Kao, O" uniqKey="Kao O">O Kao</name>
</author>
<author>
<name sortKey="Markl, V" uniqKey="Markl V">V Markl</name>
</author>
<author>
<name sortKey="Warneke, D" uniqKey="Warneke D">D Warneke</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Capriolo, E" uniqKey="Capriolo E">E Capriolo</name>
</author>
<author>
<name sortKey="Wampler, D" uniqKey="Wampler D">D Wampler</name>
</author>
<author>
<name sortKey="Rutherglen, J" uniqKey="Rutherglen J">J Rutherglen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Thusoo, A" uniqKey="Thusoo A">A Thusoo</name>
</author>
<author>
<name sortKey="Sarma, Js" uniqKey="Sarma J">JS Sarma</name>
</author>
<author>
<name sortKey="Jain, N" uniqKey="Jain N">N Jain</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ulrich, Rl" uniqKey="Ulrich R">RL Ulrich</name>
</author>
<author>
<name sortKey="Ulrich, Mp" uniqKey="Ulrich M">MP Ulrich</name>
</author>
<author>
<name sortKey="Schell, Ma" uniqKey="Schell M">MA Schell</name>
</author>
<author>
<name sortKey="Kim, Hs" uniqKey="Kim H">HS Kim</name>
</author>
<author>
<name sortKey="Deshazer, D" uniqKey="Deshazer D">D DeShazer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Godoy, D" uniqKey="Godoy D">D Godoy</name>
</author>
<author>
<name sortKey="Randle, G" uniqKey="Randle G">G Randle</name>
</author>
<author>
<name sortKey="Simpson, Aj" uniqKey="Simpson A">AJ Simpson</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Evol Bioinform Online</journal-id>
<journal-id journal-id-type="iso-abbrev">Evol. Bioinform. Online</journal-id>
<journal-id journal-id-type="publisher-id">Evolutionary Bioinformatics</journal-id>
<journal-title-group>
<journal-title>Evolutionary Bioinformatics Online</journal-title>
</journal-title-group>
<issn pub-type="epub">1176-9343</issn>
<publisher>
<publisher-name>Libertas Academica</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">26884678</article-id>
<article-id pub-id-type="pmc">4750899</article-id>
<article-id pub-id-type="doi">10.4137/EBO.S35545</article-id>
<article-id pub-id-type="publisher-id">ebo-12-2016-073</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Technical Advance</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>HTSFinder: Powerful Pipeline of DNA Signature Discovery by Parallel and Distributed Computing</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Karimi</surname>
<given-names>Ramin</given-names>
</name>
<xref ref-type="aff" rid="af1-ebo-12-2016-073">1</xref>
<xref ref-type="corresp" rid="c1-ebo-12-2016-073"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Hajdu</surname>
<given-names>Andras</given-names>
</name>
<xref ref-type="aff" rid="af1-ebo-12-2016-073">1</xref>
<xref ref-type="aff" rid="af2-ebo-12-2016-073">2</xref>
</contrib>
</contrib-group>
<aff id="af1-ebo-12-2016-073">
<label>1</label>
Faculty of Informatics, Department of Computer Graphics and Image Processing, University of Debrecen, Debrecen, Hungary.</aff>
<aff id="af2-ebo-12-2016-073">
<label>2</label>
Bioinformatics Research Group, University of Debrecen, Debrecen, Hungary.</aff>
<author-notes>
<corresp id="c1-ebo-12-2016-073">CORRESPONDENCE:
<email>raminkm2000@yahoo.ca</email>
</corresp>
</author-notes>
<pub-date pub-type="collection">
<year>2016</year>
</pub-date>
<pub-date pub-type="epub">
<day>10</day>
<month>2</month>
<year>2016</year>
</pub-date>
<volume>12</volume>
<fpage>73</fpage>
<lpage>85</lpage>
<history>
<date date-type="received">
<day>28</day>
<month>9</month>
<year>2015</year>
</date>
<date date-type="rev-recd">
<day>05</day>
<month>11</month>
<year>2015</year>
</date>
<date date-type="accepted">
<day>05</day>
<month>12</month>
<year>2015</year>
</date>
</history>
<permissions>
<copyright-statement>© 2016 the author(s), publisher and licensee Libertas Academica Ltd.</copyright-statement>
<copyright-year>2016</copyright-year>
<license license-type="open-access">
<license-p>This is an open access article published under the Creative Commons CC-BY-NC 3.0 license.</license-p>
</license>
</permissions>
<abstract>
<p>Comprehensive effort for low-cost sequencing in the past few years has led to the growth of complete genome databases. In parallel with this effort, a strong need, fast and cost-effective methods and applications have been developed to accelerate sequence analysis. Identification is the very first step of this task. Due to the difficulties, high costs, and computational challenges of alignment-based approaches, an alternative universal identification method is highly required. Like an alignment-free approach, DNA signatures have provided new opportunities for the rapid identification of species. In this paper, we present an effective pipeline HTSFinder (high-throughput signature finder) with a corresponding
<italic>k</italic>
-mer generator GkmerG (genome
<italic>k</italic>
-mers generator). Using this pipeline, we determine the frequency of
<italic>k</italic>
-mers from the available complete genome databases for the detection of extensive DNA signatures in a reasonably short time. Our application can detect both unique and common signatures in the arbitrarily selected target and nontarget databases. Hadoop and MapReduce as parallel and distributed computing tools with commodity hardware are used in this pipeline. This approach brings the power of high-performance computing into the ordinary desktop personal computers for discovering DNA signatures in large databases such as bacterial genome. A considerable number of detected unique and common DNA signatures of the target database bring the opportunities to improve the identification process not only for polymerase chain reaction and microarray assays but also for more complex scenarios such as metagenomics and next-generation sequencing analysis.</p>
</abstract>
<kwd-group>
<kwd>DNA signature</kwd>
<kwd>
<italic>k</italic>
-mers</kwd>
<kwd>Hadoop</kwd>
<kwd>WordCount</kwd>
<kwd>MapReduce</kwd>
<kwd>Hive</kwd>
</kwd-group>
</article-meta>
</front>
<floats-group>
<fig id="f1-ebo-12-2016-073" position="float">
<label>Figure 1</label>
<caption>
<p>The three main phases of HTSFinder for detecting DNA signatures. We can repeat the second phase with the obtained results if required.</p>
</caption>
<graphic xlink:href="ebo-12-2016-073Fig1"></graphic>
</fig>
<fig id="f2-ebo-12-2016-073" position="float">
<label>Figure 2</label>
<caption>
<p>Splitting of the genome by GkmerG for
<italic>k</italic>
= 18 to get all the possibilities of 18-mers. Generating
<italic>k</italic>
-mers for a single genome with GkmerG includes: purgation, splitting, concatenation, cleaning, sorting, and removing duplicate except one. The output of GkmerG is a file containing
<italic>k</italic>
-mers of a genome in a single column. The labels above the file numbers in this figure represent the beginning of four
<italic>k</italic>
-mers in the head of files.</p>
</caption>
<graphic xlink:href="ebo-12-2016-073Fig2"></graphic>
</fig>
<fig id="f3-ebo-12-2016-073" position="float">
<label>Figure 3</label>
<caption>
<p>An example of the overall MapReduce WordCount process. The original image was made by Trifork.</p>
</caption>
<graphic xlink:href="ebo-12-2016-073Fig3"></graphic>
</fig>
<fig id="f4-ebo-12-2016-073" position="float">
<label>Figure 4</label>
<caption>
<p>The recommended process for detecting unique DNA signatures of a target database against nontarget databases. In step 2 of this figure, the frequency number of
<italic>k</italic>
-mers varies from 1 to
<italic>n</italic>
, where
<italic>n</italic>
is the total number of the databases that are used in the pipeline. Since there are four databases in this figure, the frequency of
<italic>k</italic>
-mers in step 2 is from 1 to 4. In step 3, there are two input files with the list of non-repeated
<italic>k</italic>
-mers; therefore, the frequency of
<italic>k</italic>
-mers in the output is 1 or 2. Hence,
<italic>k</italic>
-mers with frequency 2 that is common in both input files are the unique signatures of database 1 against all databases.</p>
</caption>
<graphic xlink:href="ebo-12-2016-073Fig4"></graphic>
</fig>
<fig id="f5-ebo-12-2016-073" position="float">
<label>Figure 5</label>
<caption>
<p>Top 10 bacterial genomes with the highest number of unique DNA signatures in the bacterial genome database.</p>
</caption>
<graphic xlink:href="ebo-12-2016-073Fig5"></graphic>
</fig>
<fig id="f6-ebo-12-2016-073" position="float">
<label>Figure 6</label>
<caption>
<p>Ten bacterial genomes with the highest number of signatures common with
<italic>Acaryochloris_marina_MBIC11017_uid58167</italic>
in the bacterial genomes database. This is an example of the results for common signatures with frequency 2 obtained by HTSFinder.</p>
</caption>
<graphic xlink:href="ebo-12-2016-073Fig6"></graphic>
</fig>
<table-wrap id="t1-ebo-12-2016-073" position="float">
<label>Table 1</label>
<caption>
<p>A comparison of signature discovery algorithms according to the data format, computational resources, and ability to process single or multiple sequences.</p>
</caption>
<table frame="box" rules="rows">
<thead>
<tr>
<th valign="top" align="left" rowspan="1" colspan="1">NAME</th>
<th valign="top" align="left" rowspan="1" colspan="1">DATA FORMAT</th>
<th valign="top" align="left" rowspan="1" colspan="1">ADOPTED PLATFORM ACCORDING TO THE PUBLICATION</th>
<th valign="top" align="left" rowspan="1" colspan="1">BLAST SPECIFICITY</th>
<th valign="top" align="left" rowspan="1" colspan="1">ABILITY FOR SINGLE SEQUENCE</th>
<th valign="top" align="left" rowspan="1" colspan="1">ABILITY FOR MULTIPLE SEQUENCES</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">TOFI</td>
<td valign="top" align="left" rowspan="1" colspan="1">FASTA</td>
<td valign="top" align="left" rowspan="1" colspan="1">64 × 1.5 GHz Itanium 2 processors running on Linux with 64 GB of shared memory</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="left" rowspan="1" colspan="1">×</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">TOPSI</td>
<td valign="top" align="left" rowspan="1" colspan="1">FASTA</td>
<td valign="top" align="left" rowspan="1" colspan="1">98-cores Linux cluster</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">Insignia</td>
<td valign="top" align="left" rowspan="1" colspan="1">FASTA</td>
<td valign="top" align="left" rowspan="1" colspan="1">192-node Linux cluster</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">CaSSiS</td>
<td valign="top" align="left" rowspan="1" colspan="1">rRNA</td>
<td valign="top" align="left" rowspan="1" colspan="1">Intel Core i7 CPU (4 cores, 2.67 GHz) with 24 GB of RAM</td>
<td valign="top" align="left" rowspan="1" colspan="1">×</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">CMD and PISD</td>
<td valign="top" align="left" rowspan="1" colspan="1">ESTs</td>
<td valign="top" align="left" rowspan="1" colspan="1">Dell PowerEdge R900 server with two Intel Xeon E7430 2.13 GHz quad-core CPUs, 12 GB RAM</td>
<td valign="top" align="left" rowspan="1" colspan="1">×</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="left" rowspan="1" colspan="1">×</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">IMUS</td>
<td valign="top" align="left" rowspan="1" colspan="1">ESTs</td>
<td valign="top" align="left" rowspan="1" colspan="1">Intel 2.93GHz CPU</td>
<td valign="top" align="left" rowspan="1" colspan="1">×</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="left" rowspan="1" colspan="1">×</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">PIMUS</td>
<td valign="top" align="left" rowspan="1" colspan="1">ESTs</td>
<td valign="top" align="left" rowspan="1" colspan="1">Intel Core i7 870 2.93GHz quad-core CPU and 16 GB RAM</td>
<td valign="top" align="left" rowspan="1" colspan="1">×</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="left" rowspan="1" colspan="1">×</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">DDCSD</td>
<td valign="top" align="left" rowspan="1" colspan="1">ESTs</td>
<td valign="top" align="left" rowspan="1" colspan="1">A Master node: Intel Core i7 CPU 870 at 2.93 GHz and 16 GB RAM 10 Slave nodes: Intel Core i7 CPU 3770 K at 3.50 GHz and 32 GB of RAM for each one</td>
<td valign="top" align="left" rowspan="1" colspan="1">×</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="t2-ebo-12-2016-073" position="float">
<label>Table 2</label>
<caption>
<p>An example of Hadoop and WordCount results.</p>
</caption>
<table frame="box" rules="rows">
<thead>
<tr>
<th valign="top" align="left" rowspan="1" colspan="1">SIGNATURES OR 18-MERS</th>
<th valign="top" align="left" rowspan="1" colspan="1">FREQUENCY IN THE DATABASE</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAAAAAGAG</td>
<td valign="top" align="left" rowspan="1" colspan="1">8</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAAAAAGAT</td>
<td valign="top" align="left" rowspan="1" colspan="1">25</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAAAAAGCA</td>
<td valign="top" align="left" rowspan="1" colspan="1">20</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAAAAAGCC</td>
<td valign="top" align="left" rowspan="1" colspan="1">1</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAAAAAGCG</td>
<td valign="top" align="left" rowspan="1" colspan="1">5</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAAAAAGCT</td>
<td valign="top" align="left" rowspan="1" colspan="1">6</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAAAAAGGA</td>
<td valign="top" align="left" rowspan="1" colspan="1">9</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAAAAAGGC</td>
<td valign="top" align="left" rowspan="1" colspan="1">3</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAAAAAGGG</td>
<td valign="top" align="left" rowspan="1" colspan="1">6</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAAAAAGGT</td>
<td valign="top" align="left" rowspan="1" colspan="1">38</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="t3-ebo-12-2016-073" position="float">
<label>Table 3</label>
<caption>
<p>Total number of 10 least and 10 most common signatures in the bacterial genome database.</p>
</caption>
<table frame="box" rules="rows">
<thead>
<tr>
<th valign="top" align="left" rowspan="1" colspan="1">FREQUENCY (LEAST COMMON)</th>
<th valign="top" align="left" rowspan="1" colspan="1">NUMBER OF SIGNATURES IN THE DATABASE</th>
<th valign="top" align="left" rowspan="1" colspan="1">FREQUENCY (MOST COMMON)</th>
<th valign="top" align="left" rowspan="1" colspan="1">NUMBER OF 18-MERS IN THE DATABASE</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">1</td>
<td valign="top" align="left" rowspan="1" colspan="1">3,552,866,254</td>
<td valign="top" align="left" rowspan="1" colspan="1">2040</td>
<td valign="top" align="left" rowspan="1" colspan="1">1</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">2</td>
<td valign="top" align="left" rowspan="1" colspan="1">689,790,798</td>
<td valign="top" align="left" rowspan="1" colspan="1">2042</td>
<td valign="top" align="left" rowspan="1" colspan="1">1</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">3</td>
<td valign="top" align="left" rowspan="1" colspan="1">245,109,794</td>
<td valign="top" align="left" rowspan="1" colspan="1">2044</td>
<td valign="top" align="left" rowspan="1" colspan="1">1</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">4</td>
<td valign="top" align="left" rowspan="1" colspan="1">114,234,398</td>
<td valign="top" align="left" rowspan="1" colspan="1">2074</td>
<td valign="top" align="left" rowspan="1" colspan="1">2</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">5</td>
<td valign="top" align="left" rowspan="1" colspan="1">68,395,645</td>
<td valign="top" align="left" rowspan="1" colspan="1">2075</td>
<td valign="top" align="left" rowspan="1" colspan="1">1</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">6</td>
<td valign="top" align="left" rowspan="1" colspan="1">48,107,467</td>
<td valign="top" align="left" rowspan="1" colspan="1">2102</td>
<td valign="top" align="left" rowspan="1" colspan="1">1</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">7</td>
<td valign="top" align="left" rowspan="1" colspan="1">31,544,271</td>
<td valign="top" align="left" rowspan="1" colspan="1">2112</td>
<td valign="top" align="left" rowspan="1" colspan="1">1</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">8</td>
<td valign="top" align="left" rowspan="1" colspan="1">26,164,511</td>
<td valign="top" align="left" rowspan="1" colspan="1">2113</td>
<td valign="top" align="left" rowspan="1" colspan="1">2</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">9</td>
<td valign="top" align="left" rowspan="1" colspan="1">23,650,821</td>
<td valign="top" align="left" rowspan="1" colspan="1">2114</td>
<td valign="top" align="left" rowspan="1" colspan="1">2</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">10</td>
<td valign="top" align="left" rowspan="1" colspan="1">16,156,541</td>
<td valign="top" align="left" rowspan="1" colspan="1">2125</td>
<td valign="top" align="left" rowspan="1" colspan="1">1</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="t4-ebo-12-2016-073" position="float">
<label>Table 4</label>
<caption>
<p>An example of the output for the third phase (the right side of the table). The reference numbers in this table indicates the numbers appended by GkmerG for easier tracking of data in the pipeline.</p>
</caption>
<table frame="box" rules="rows">
<thead>
<tr>
<th valign="top" align="left" rowspan="1" colspan="1">SIGNATURE</th>
<th valign="top" align="left" rowspan="1" colspan="1">GkmerG REFERENCE NUMBER</th>
<th valign="top" align="left" rowspan="1" colspan="1">NAME OF THE BACTERIAL GENOME</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAACGCTCTGATATGA</td>
<td valign="top" align="left" rowspan="1" colspan="1">1059</td>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Eubacterium_rectale_ATCC_33656_uid59169</italic>
</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAACGCTCTGCCACCA</td>
<td valign="top" align="left" rowspan="1" colspan="1">1520</td>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Methanobacterium_SWAN_1_uid67359</italic>
</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAACGCTCTGGGAATT</td>
<td valign="top" align="left" rowspan="1" colspan="1">705</td>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Chromohalobacter_salexigens_DSM_3043_uid62921</italic>
</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAACGCTCTTTTATTT</td>
<td valign="top" align="left" rowspan="1" colspan="1">472</td>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Campylobacter_hominis_ATCC_BAA_381_uid58981</italic>
</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAACGCTGAAACGCCT</td>
<td valign="top" align="left" rowspan="1" colspan="1">2649</td>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Tolumonas_auensis_DSM_9187_uid59395</italic>
</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAACGCTGAAATCCGC</td>
<td valign="top" align="left" rowspan="1" colspan="1">2013</td>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Rahnella_Y9602_uid62715</italic>
</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAACGCTGAATGAAGC</td>
<td valign="top" align="left" rowspan="1" colspan="1">39</td>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Acinetobacter_ADP1_uid61597</italic>
</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAACGCTGACAATAAA</td>
<td valign="top" align="left" rowspan="1" colspan="1">1337</td>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Lactobacillus_brevis_KB290_uid195560</italic>
</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAACGCTGACCTTCTA</td>
<td valign="top" align="left" rowspan="1" colspan="1">1</td>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Acaryochloris_marina_MBIC11017_uid58167</italic>
</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAACGCTGACGGAAGT</td>
<td valign="top" align="left" rowspan="1" colspan="1">2126</td>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Ruminococcus_albus_7_uid51721</italic>
</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="t5-ebo-12-2016-073" position="float">
<label>Table 5</label>
<caption>
<p>
<italic>B. mallei</italic>
and
<italic>B. pseudomallei</italic>
genomes with their number of unique DNA signatures of 18-mers in the bacterial genome database.</p>
</caption>
<table frame="box" rules="rows">
<thead>
<tr>
<th valign="top" align="left" rowspan="1" colspan="1">THE REFERENCE NUMBER AND NAME OF THE BURKHOLDERIA GENOMES</th>
<th valign="top" align="left" rowspan="1" colspan="1">NUMBER OF UNIQUE DNA SIGNATURES</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Burkholderia_mallei_ATCC_23344_uid57725</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">90,278</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Burkholderia_mallei_NCTC_10229_uid58383</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">24,858</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Burkholderia_mallei_NCTC_10247_uid58385</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">19,442</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Burkholderia_mallei_SAVP1_uid58387</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">7,649</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Burkholderia_pseudomallei_1026b_uid162511</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">282,992</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Burkholderia_pseudomallei_1106a_uid58515</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">173,688</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Burkholderia_pseudomallei_1710b_uid58391</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">41,153</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Burkholderia_pseudomallei_668_uid58389</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">218,985</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Burkholderia_pseudomallei_BPC006_uid174460</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">81,768</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Burkholderia_pseudomallei_K96243_uid57733</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">195,711</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Burkholderia_pseudomallei_MSHR305_uid213227</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">320,198</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Burkholderia_pseudomallei_MSHR346_uid55259</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">172,551</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Burkholderia_pseudomallei_NCTC_13179_uid226109</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">382,494</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="t6-ebo-12-2016-073" position="float">
<label>Table 6</label>
<caption>
<p>A portion of results for signatures with frequencies 2 and 3 in the database. Concerning the reference numbers, most of the common signatures are shared among the phylogenetically close genomes. However, number of common signatures among unrelated species are also notable.</p>
</caption>
<table frame="box" rules="rows">
<thead>
<tr>
<th valign="top" align="left" rowspan="1" colspan="1">SIGNATURES WITH FREQUENCY = 2</th>
<th colspan="2" valign="top" align="left" rowspan="1">GkmerG REFERENCE NUMBERS</th>
<th valign="top" align="left" rowspan="1" colspan="1">SIGNATURES WITH FREQUENCY = 3</th>
<th colspan="3" valign="top" align="left" rowspan="1">GkmerG REFERENCE NUMBERS</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAGATAAATA</td>
<td valign="top" align="left" rowspan="1" colspan="1">355</td>
<td valign="top" align="left" rowspan="1" colspan="1">508</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAAATATCG</td>
<td valign="top" align="left" rowspan="1" colspan="1">1709</td>
<td valign="top" align="left" rowspan="1" colspan="1">1708</td>
<td valign="top" align="left" rowspan="1" colspan="1">2677</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAACAGACACAA</td>
<td valign="top" align="left" rowspan="1" colspan="1">2110</td>
<td valign="top" align="left" rowspan="1" colspan="1">2109</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAAAACAGAAC</td>
<td valign="top" align="left" rowspan="1" colspan="1">1249</td>
<td valign="top" align="left" rowspan="1" colspan="1">1255</td>
<td valign="top" align="left" rowspan="1" colspan="1">1267</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAACAGCATTAA</td>
<td valign="top" align="left" rowspan="1" colspan="1">2209</td>
<td valign="top" align="left" rowspan="1" colspan="1">2214</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAATAAATACA</td>
<td valign="top" align="left" rowspan="1" colspan="1">2726</td>
<td valign="top" align="left" rowspan="1" colspan="1">2734</td>
<td valign="top" align="left" rowspan="1" colspan="1">542</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAACAGGCTTAC</td>
<td valign="top" align="left" rowspan="1" colspan="1">394</td>
<td valign="top" align="left" rowspan="1" colspan="1">1499</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAGAAACAAAG</td>
<td valign="top" align="left" rowspan="1" colspan="1">681</td>
<td valign="top" align="left" rowspan="1" colspan="1">678</td>
<td valign="top" align="left" rowspan="1" colspan="1">679</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAACCGCCGAAC</td>
<td valign="top" align="left" rowspan="1" colspan="1">1046</td>
<td valign="top" align="left" rowspan="1" colspan="1">1048</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAGATGTTAAT</td>
<td valign="top" align="left" rowspan="1" colspan="1">969</td>
<td valign="top" align="left" rowspan="1" colspan="1">2384</td>
<td valign="top" align="left" rowspan="1" colspan="1">247</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAACCGCTTTTA</td>
<td valign="top" align="left" rowspan="1" colspan="1">1879</td>
<td valign="top" align="left" rowspan="1" colspan="1">1265</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAGCAAAACAA</td>
<td valign="top" align="left" rowspan="1" colspan="1">2223</td>
<td valign="top" align="left" rowspan="1" colspan="1">355</td>
<td valign="top" align="left" rowspan="1" colspan="1">102</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAACGAACAAAC</td>
<td valign="top" align="left" rowspan="1" colspan="1">101</td>
<td valign="top" align="left" rowspan="1" colspan="1">1813</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAGTAAATGCG</td>
<td valign="top" align="left" rowspan="1" colspan="1">1793</td>
<td valign="top" align="left" rowspan="1" colspan="1">2731</td>
<td valign="top" align="left" rowspan="1" colspan="1">2730</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAACGATTCAGA</td>
<td valign="top" align="left" rowspan="1" colspan="1">2106</td>
<td valign="top" align="left" rowspan="1" colspan="1">2107</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAATAGACAATG</td>
<td valign="top" align="left" rowspan="1" colspan="1">498</td>
<td valign="top" align="left" rowspan="1" colspan="1">500</td>
<td valign="top" align="left" rowspan="1" colspan="1">755</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAACTAATGCTT</td>
<td valign="top" align="left" rowspan="1" colspan="1">349</td>
<td valign="top" align="left" rowspan="1" colspan="1">355</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAATATTCATGC</td>
<td valign="top" align="left" rowspan="1" colspan="1">321</td>
<td valign="top" align="left" rowspan="1" colspan="1">897</td>
<td valign="top" align="left" rowspan="1" colspan="1">560</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAACTAATTCTG</td>
<td valign="top" align="left" rowspan="1" colspan="1">1406</td>
<td valign="top" align="left" rowspan="1" colspan="1">1408</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAATTCAAAATT</td>
<td valign="top" align="left" rowspan="1" colspan="1">567</td>
<td valign="top" align="left" rowspan="1" colspan="1">505</td>
<td valign="top" align="left" rowspan="1" colspan="1">325</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAGAACCAAAC</td>
<td valign="top" align="left" rowspan="1" colspan="1">544</td>
<td valign="top" align="left" rowspan="1" colspan="1">545</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAATTTAGCGAT</td>
<td valign="top" align="left" rowspan="1" colspan="1">2714</td>
<td valign="top" align="left" rowspan="1" colspan="1">1814</td>
<td valign="top" align="left" rowspan="1" colspan="1">321</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAGACTGACTC</td>
<td valign="top" align="left" rowspan="1" colspan="1">2696</td>
<td valign="top" align="left" rowspan="1" colspan="1">1066</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAATTTTTATAG</td>
<td valign="top" align="left" rowspan="1" colspan="1">703</td>
<td valign="top" align="left" rowspan="1" colspan="1">402</td>
<td valign="top" align="left" rowspan="1" colspan="1">324</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAGATGTTGTA</td>
<td valign="top" align="left" rowspan="1" colspan="1">545</td>
<td valign="top" align="left" rowspan="1" colspan="1">544</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAACAAGAAGCGC</td>
<td valign="top" align="left" rowspan="1" colspan="1">1426</td>
<td valign="top" align="left" rowspan="1" colspan="1">1427</td>
<td valign="top" align="left" rowspan="1" colspan="1">1428</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAAGGATTCGAA</td>
<td valign="top" align="left" rowspan="1" colspan="1">1428</td>
<td valign="top" align="left" rowspan="1" colspan="1">1427</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAACAATTAGCGA</td>
<td valign="top" align="left" rowspan="1" colspan="1">1128</td>
<td valign="top" align="left" rowspan="1" colspan="1">2677</td>
<td valign="top" align="left" rowspan="1" colspan="1">2404</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAATAAAGACTC</td>
<td valign="top" align="left" rowspan="1" colspan="1">345</td>
<td valign="top" align="left" rowspan="1" colspan="1">343</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAACAGATAGTGA</td>
<td valign="top" align="left" rowspan="1" colspan="1">2115</td>
<td valign="top" align="left" rowspan="1" colspan="1">1061</td>
<td valign="top" align="left" rowspan="1" colspan="1">1508</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAAATAGTGACGA</td>
<td valign="top" align="left" rowspan="1" colspan="1">1686</td>
<td valign="top" align="left" rowspan="1" colspan="1">1693</td>
<td valign="top" align="left" rowspan="1" colspan="1">AAAAAAAACAGCAGCACC</td>
<td valign="top" align="left" rowspan="1" colspan="1">2535</td>
<td valign="top" align="left" rowspan="1" colspan="1">1584</td>
<td valign="top" align="left" rowspan="1" colspan="1">1058</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="t7-ebo-12-2016-073" position="float">
<label>Table 7</label>
<caption>
<p>Number of unique DNA signatures in the human genome and its three chromosomes with different sequence sizes for a series of lengths of
<italic>k</italic>
-mers from 21 to 30.</p>
</caption>
<table frame="box" rules="rows">
<thead>
<tr>
<th valign="top" align="left" rowspan="1" colspan="1">LENGTH OF SIGNATURE</th>
<th valign="top" align="left" rowspan="1" colspan="1">THE WHOLE GENOME (2.8 GB)</th>
<th valign="top" align="left" rowspan="1" colspan="1">CHR1 (222 MB)</th>
<th valign="top" align="left" rowspan="1" colspan="1">CHRX (147 MB)</th>
<th valign="top" align="left" rowspan="1" colspan="1">CHRY (19 MB)</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>k</italic>
= 21</td>
<td valign="top" align="left" rowspan="1" colspan="1">2.24297e+09</td>
<td valign="top" align="left" rowspan="1" colspan="1">176,137,004</td>
<td valign="top" align="left" rowspan="1" colspan="1">109,691,126</td>
<td valign="top" align="left" rowspan="1" colspan="1">10,221,240</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>k</italic>
= 22</td>
<td valign="top" align="left" rowspan="1" colspan="1">2.28624e+09</td>
<td valign="top" align="left" rowspan="1" colspan="1">179,436,876</td>
<td valign="top" align="left" rowspan="1" colspan="1">112,370,062</td>
<td valign="top" align="left" rowspan="1" colspan="1">10,550,076</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>k</italic>
= 23</td>
<td valign="top" align="left" rowspan="1" colspan="1">2.31954e+09</td>
<td valign="top" align="left" rowspan="1" colspan="1">181,982,115</td>
<td valign="top" align="left" rowspan="1" colspan="1">114,505,744</td>
<td valign="top" align="left" rowspan="1" colspan="1">10,825,761</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>k</italic>
= 24</td>
<td valign="top" align="left" rowspan="1" colspan="1">2.34792e+09</td>
<td valign="top" align="left" rowspan="1" colspan="1">184,157,371</td>
<td valign="top" align="left" rowspan="1" colspan="1">116,349,017</td>
<td valign="top" align="left" rowspan="1" colspan="1">11,070,875</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>k</italic>
= 25</td>
<td valign="top" align="left" rowspan="1" colspan="1">2.37333e+09</td>
<td valign="top" align="left" rowspan="1" colspan="1">186,108,431</td>
<td valign="top" align="left" rowspan="1" colspan="1">117,999,580</td>
<td valign="top" align="left" rowspan="1" colspan="1">11,294,439</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>k</italic>
= 26</td>
<td valign="top" align="left" rowspan="1" colspan="1">2.39664e+09</td>
<td valign="top" align="left" rowspan="1" colspan="1">187,904,867</td>
<td valign="top" align="left" rowspan="1" colspan="1">119,504,725</td>
<td valign="top" align="left" rowspan="1" colspan="1">11,501,139</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>k</italic>
= 27</td>
<td valign="top" align="left" rowspan="1" colspan="1">2.41829e+09</td>
<td valign="top" align="left" rowspan="1" colspan="1">189,580,382</td>
<td valign="top" align="left" rowspan="1" colspan="1">120,891,039</td>
<td valign="top" align="left" rowspan="1" colspan="1">11,693,180</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>k</italic>
= 28</td>
<td valign="top" align="left" rowspan="1" colspan="1">2.43849e+09</td>
<td valign="top" align="left" rowspan="1" colspan="1">191,150,531</td>
<td valign="top" align="left" rowspan="1" colspan="1">122,172,724</td>
<td valign="top" align="left" rowspan="1" colspan="1">11,872,250</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>k</italic>
= 29</td>
<td valign="top" align="left" rowspan="1" colspan="1">2.45744e+09</td>
<td valign="top" align="left" rowspan="1" colspan="1">192,629,345</td>
<td valign="top" align="left" rowspan="1" colspan="1">123,363,397</td>
<td valign="top" align="left" rowspan="1" colspan="1">12,039,734</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>k</italic>
= 30</td>
<td valign="top" align="left" rowspan="1" colspan="1">2.47529e+09</td>
<td valign="top" align="left" rowspan="1" colspan="1">194,027,911</td>
<td valign="top" align="left" rowspan="1" colspan="1">124,472,828</td>
<td valign="top" align="left" rowspan="1" colspan="1">12,196,710</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="t8-ebo-12-2016-073" position="float">
<label>Table 8</label>
<caption>
<p>A comparison of five
<italic>Bacillus</italic>
strains with the highest number of unique signatures and five others with the lowest number of signatures of length 18 within species and in the entire database. This table shows that within-species similarity and variability have more influence on the volume of signatures than the remainder of the database.</p>
</caption>
<table frame="box" rules="rows">
<thead>
<tr>
<th valign="top" align="left" rowspan="1" colspan="1">NAME OF STRAINS</th>
<th valign="top" align="left" rowspan="1" colspan="1">WITHIN-SPECIES</th>
<th valign="top" align="left" rowspan="1" colspan="1">IN THE ENTIRE DATABASE</th>
<th valign="top" align="left" rowspan="1" colspan="1">THE ORIGINAL GENOME SIZE</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Bacillus_megaterium_WSH_002_uid159841</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">4,860,315</td>
<td valign="top" align="left" rowspan="1" colspan="1">4,012,591</td>
<td valign="top" align="left" rowspan="1" colspan="1">5,0 MB</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Bacillus_infantis_NRRL_B_14911_uid222804</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">4,712,042</td>
<td valign="top" align="left" rowspan="1" colspan="1">3,932,760</td>
<td valign="top" align="left" rowspan="1" colspan="1">4,8 MB</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Bacillus_1NLA3E_uid81841</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">4,527,694</td>
<td valign="top" align="left" rowspan="1" colspan="1">3,734,930</td>
<td valign="top" align="left" rowspan="1" colspan="1">4,7 MB</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Bacillus_cellulosilyticus_DSM_2522_uid43329</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">4,441,938</td>
<td valign="top" align="left" rowspan="1" colspan="1">3,688,824</td>
<td valign="top" align="left" rowspan="1" colspan="1">4,6 MB</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Bacillus_clausii_KSM_K16_uid58237</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">4,177,156</td>
<td valign="top" align="left" rowspan="1" colspan="1">3,576,848</td>
<td valign="top" align="left" rowspan="1" colspan="1">4,2 MB</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Bacillus_subtilis_168_uid57675</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">248</td>
<td valign="top" align="left" rowspan="1" colspan="1">205</td>
<td valign="top" align="left" rowspan="1" colspan="1">4,1 MB</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Bacillus_amyloliquefaciens_CC178_uid226115</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">247</td>
<td valign="top" align="left" rowspan="1" colspan="1">202</td>
<td valign="top" align="left" rowspan="1" colspan="1">3,8 MB</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Bacillus_anthracis_A0248_uid59385</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">0</td>
<td valign="top" align="left" rowspan="1" colspan="1">0</td>
<td valign="top" align="left" rowspan="1" colspan="1">5,4 MB</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Bacillus_anthracis_A2012_uid54101</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">0</td>
<td valign="top" align="left" rowspan="1" colspan="1">0</td>
<td valign="top" align="left" rowspan="1" colspan="1">284 KB</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">
<italic>Bacillus_anthracis_Ames_Ancestor_uid58083</italic>
</td>
<td valign="top" align="left" rowspan="1" colspan="1">0</td>
<td valign="top" align="left" rowspan="1" colspan="1">0</td>
<td valign="top" align="left" rowspan="1" colspan="1">5,4 MB</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="t9-ebo-12-2016-073" position="float">
<label>Table 9</label>
<caption>
<p>Number of unique DNA signatures for the forward bacterial genome database as the target and two other nontarget databases.</p>
</caption>
<table frame="box" rules="rows">
<thead>
<tr>
<th valign="top" align="left" rowspan="1" colspan="1">DATABASES</th>
<th valign="top" align="left" rowspan="1" colspan="1">FILE SIZE</th>
<th valign="top" align="left" rowspan="1" colspan="1">NUMBER OF SIGNATURES</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">Unique signatures of the Forward bacterial genome database</td>
<td valign="top" align="left" rowspan="1" colspan="1">67.5 GB</td>
<td valign="top" align="left" rowspan="1" colspan="1">3,552,866,254</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">Forward + Reverse-Complement bacterial genome databases</td>
<td valign="top" align="left" rowspan="1" colspan="1">52.53 GB</td>
<td valign="top" align="left" rowspan="1" colspan="1">2,764,759,739</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">Forward + Reverse-Complement bacterial genome + Human genome databases</td>
<td valign="top" align="left" rowspan="1" colspan="1">50.28 GB</td>
<td valign="top" align="left" rowspan="1" colspan="1">2,646,494,945</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="t10-ebo-12-2016-073" position="float">
<label>Table 10</label>
<caption>
<p>A comparison of computational results of the first and second platforms in the second and third phases of the pipeline in order to find unique DNA signatures and their related species in the forward genome database (time in minutes).</p>
</caption>
<table frame="box" rules="rows">
<thead>
<tr>
<th valign="top" align="left" rowspan="1" colspan="1">STEPS</th>
<th valign="top" align="left" rowspan="1" colspan="1">FILE SIZE (GB)</th>
<th colspan="2" valign="top" align="left" rowspan="1">TIME FOR THE FIRST PLATFORM</th>
<th valign="top" align="left" rowspan="1" colspan="1">TIME FOR THE SECOND PLATFORM</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">Copy
<italic>k</italic>
-mers generated by GkmerG to the HDFS</td>
<td valign="top" align="left" rowspan="1" colspan="1">177.35</td>
<td valign="top" align="left" rowspan="1" colspan="1">60</td>
<td colspan="2" valign="top" align="left" rowspan="1">63</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">WordCount process</td>
<td valign="top" align="left" rowspan="1" colspan="1">177.35</td>
<td valign="top" align="left" rowspan="1" colspan="1">447</td>
<td colspan="2" valign="top" align="left" rowspan="1">1169</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">Copy the result from HDFS to a local directory</td>
<td valign="top" align="left" rowspan="1" colspan="1">103.03</td>
<td valign="top" align="left" rowspan="1" colspan="1">34</td>
<td colspan="2" valign="top" align="left" rowspan="1">27</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">Extracting unique signatures and creating tables in Hive</td>
<td valign="top" align="left" rowspan="1" colspan="1">67.5</td>
<td valign="top" align="left" rowspan="1" colspan="1">60</td>
<td colspan="2" valign="top" align="left" rowspan="1">60</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">Loading unique signatures to the Hive table</td>
<td valign="top" align="left" rowspan="1" colspan="1">67.5</td>
<td valign="top" align="left" rowspan="1" colspan="1">23</td>
<td colspan="2" valign="top" align="left" rowspan="1">26</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">Loading
<italic>k</italic>
-mers and reference numbers to the Hive table</td>
<td valign="top" align="left" rowspan="1" colspan="1">220.35</td>
<td valign="top" align="left" rowspan="1" colspan="1">79</td>
<td colspan="2" valign="top" align="left" rowspan="1">83</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">Executing the queries and copy the result to a local directory</td>
<td valign="top" align="left" rowspan="1" colspan="1">83.83</td>
<td valign="top" align="left" rowspan="1" colspan="1">1120</td>
<td colspan="2" valign="top" align="left" rowspan="1">959</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">Total computational time</td>
<td valign="top" align="left" rowspan="1" colspan="1"></td>
<td valign="top" align="left" rowspan="1" colspan="1">1823</td>
<td colspan="2" valign="top" align="left" rowspan="1">2387</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="t11-ebo-12-2016-073" position="float">
<label>Table 11</label>
<caption>
<p>A comparison of loading and execution times of the frequencies 1–3 in the third phase.</p>
</caption>
<table frame="box" rules="rows">
<thead>
<tr>
<th valign="top" align="left" rowspan="1" colspan="1">FREQUENCY</th>
<th valign="top" align="left" rowspan="1" colspan="1">1</th>
<th valign="top" align="left" rowspan="1" colspan="1">2</th>
<th valign="top" align="left" rowspan="1" colspan="1">3</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">Size of the file containing signatures (GB)</td>
<td valign="top" align="left" rowspan="1" colspan="1">67.5</td>
<td valign="top" align="left" rowspan="1" colspan="1">13.1</td>
<td valign="top" align="left" rowspan="1" colspan="1">4.66</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">Time for loading file into the Hive table (minutes)</td>
<td valign="top" align="left" rowspan="1" colspan="1">23</td>
<td valign="top" align="left" rowspan="1" colspan="1">4</td>
<td valign="top" align="left" rowspan="1" colspan="1">1</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="1" colspan="1">Execution time and copy the result to local directory (minutes)</td>
<td valign="top" align="left" rowspan="1" colspan="1">1120</td>
<td valign="top" align="left" rowspan="1" colspan="1">661</td>
<td valign="top" align="left" rowspan="1" colspan="1">557</td>
</tr>
</tbody>
</table>
</table-wrap>
</floats-group>
</pmc>
<affiliations>
<list>
<country>
<li>Hongrie</li>
</country>
</list>
<tree>
<country name="Hongrie">
<noRegion>
<name sortKey="Karimi, Ramin" sort="Karimi, Ramin" uniqKey="Karimi R" first="Ramin" last="Karimi">Ramin Karimi</name>
</noRegion>
<name sortKey="Hajdu, Andras" sort="Hajdu, Andras" uniqKey="Hajdu A" first="Andras" last="Hajdu">Andras Hajdu</name>
<name sortKey="Hajdu, Andras" sort="Hajdu, Andras" uniqKey="Hajdu A" first="Andras" last="Hajdu">Andras Hajdu</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Checkpoint
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000B47 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Checkpoint/biblio.hfd -nk 000B47 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Checkpoint
   |type=    RBID
   |clé=     PMC:4750899
   |texte=   HTSFinder: Powerful Pipeline of DNA Signature Discovery by Parallel and Distributed Computing
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Checkpoint/RBID.i   -Sk "pubmed:26884678" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Checkpoint/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021