MersV1, Pmc, Corpus, bibRecord, 001118

StrainSeeker: fast identification of bacterial strains from raw sequencing reads using user-provided guide trees

Identifieur interne : 001118 ( Pmc/Corpus ); précédent : 001117; suivant : 001119

StrainSeeker: fast identification of bacterial strains from raw sequencing reads using user-provided guide trees

Auteurs : M Rt Roosaare ; Mihkel Vaher ; Lauris Kaplinski ; M Rt Möls ; Reidar Andreson ; Maarja Lepamets ; Triinu K Ressaar ; Paul Naaber ; Siiri K Ljalg ; Maido Remm

Source :

PeerJ [ 2167-8359 ] ; 2017.

RBID : PMC:5438578

Abstract

Background

Fast, accurate and high-throughput identification of bacterial isolates is in great demand. The present work was conducted to investigate the possibility of identifying isolates from unassembled next-generation sequencing reads using custom-made guide trees.

Results

A tool named StrainSeeker was developed that constructs a list of specific k-mers for each node of any given Newick-format tree and enables the identification of bacterial isolates in 1–2 min. It uses a novel algorithm, which analyses the observed and expected fractions of node-specific k-mers to test the presence of each node in the sample. This allows StrainSeeker to determine where the isolate branches off the guide tree and assign it to a clade whereas other tools assign each read to a reference genome. Using a dataset of 100 Escherichia coli isolates, we demonstrate that StrainSeeker can predict the clades of E. coli with 92% accuracy and correct tree branch assignment with 98% accuracy. Twenty-five thousand Illumina HiSeq reads are sufficient for identification of the strain.

Conclusion

StrainSeeker is a software program that identifies bacterial isolates by assigning them to nodes or leaves of a custom-made guide tree. StrainSeeker’s web interface and pre-computed guide trees are available at http://bioinfo.ut.ee/strainseeker. Source code is stored at GitHub: https://github.com/bioinfo-ut/StrainSeeker.

Url:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5438578

DOI: 10.7717/peerj.3353
PubMed: 28533988
PubMed Central: 5438578

Links to Exploration step

PMC:5438578

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">StrainSeeker: fast identification of bacterial strains from raw sequencing reads using user-provided guide trees</title>
<author><name sortKey="Roosaare, M Rt" sort="Roosaare, M Rt" uniqKey="Roosaare M" first="M Rt" last="Roosaare">M Rt Roosaare</name>
<affiliation><nlm:aff id="aff-1"><institution>Department of Bioinformatics, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Vaher, Mihkel" sort="Vaher, Mihkel" uniqKey="Vaher M" first="Mihkel" last="Vaher">Mihkel Vaher</name>
<affiliation><nlm:aff id="aff-1"><institution>Department of Bioinformatics, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Kaplinski, Lauris" sort="Kaplinski, Lauris" uniqKey="Kaplinski L" first="Lauris" last="Kaplinski">Lauris Kaplinski</name>
<affiliation><nlm:aff id="aff-1"><institution>Department of Bioinformatics, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Mols, M Rt" sort="Mols, M Rt" uniqKey="Mols M" first="M Rt" last="Möls">M Rt Möls</name>
<affiliation><nlm:aff id="aff-1"><institution>Department of Bioinformatics, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="aff-2"><institution>Institute of Mathematical Statistics, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Andreson, Reidar" sort="Andreson, Reidar" uniqKey="Andreson R" first="Reidar" last="Andreson">Reidar Andreson</name>
<affiliation><nlm:aff id="aff-1"><institution>Department of Bioinformatics, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Lepamets, Maarja" sort="Lepamets, Maarja" uniqKey="Lepamets M" first="Maarja" last="Lepamets">Maarja Lepamets</name>
<affiliation><nlm:aff id="aff-1"><institution>Department of Bioinformatics, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="K Ressaar, Triinu" sort="K Ressaar, Triinu" uniqKey="K Ressaar T" first="Triinu" last="K Ressaar">Triinu K Ressaar</name>
<affiliation><nlm:aff id="aff-1"><institution>Department of Bioinformatics, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Naaber, Paul" sort="Naaber, Paul" uniqKey="Naaber P" first="Paul" last="Naaber">Paul Naaber</name>
<affiliation><nlm:aff id="aff-3"><institution>Synlab Eesti</institution>
,<addr-line>Tallinn</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="aff-4"><institution>Department of Microbiology, Institute of Biomedicine and Translational Medicine, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="K Ljalg, Siiri" sort="K Ljalg, Siiri" uniqKey="K Ljalg S" first="Siiri" last="K Ljalg">Siiri K Ljalg</name>
<affiliation><nlm:aff id="aff-4"><institution>Department of Microbiology, Institute of Biomedicine and Translational Medicine, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="aff-5"><institution>United Laboratories, Tartu University Clinics</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Remm, Maido" sort="Remm, Maido" uniqKey="Remm M" first="Maido" last="Remm">Maido Remm</name>
<affiliation><nlm:aff id="aff-1"><institution>Department of Bioinformatics, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">28533988</idno>
<idno type="pmc">5438578</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5438578</idno>
<idno type="RBID">PMC:5438578</idno>
<idno type="doi">10.7717/peerj.3353</idno>
<date when="2017">2017</date>
<idno type="wicri:Area/Pmc/Corpus">001118</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">001118</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">StrainSeeker: fast identification of bacterial strains from raw sequencing reads using user-provided guide trees</title>
<author><name sortKey="Roosaare, M Rt" sort="Roosaare, M Rt" uniqKey="Roosaare M" first="M Rt" last="Roosaare">M Rt Roosaare</name>
<affiliation><nlm:aff id="aff-1"><institution>Department of Bioinformatics, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Vaher, Mihkel" sort="Vaher, Mihkel" uniqKey="Vaher M" first="Mihkel" last="Vaher">Mihkel Vaher</name>
<affiliation><nlm:aff id="aff-1"><institution>Department of Bioinformatics, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Kaplinski, Lauris" sort="Kaplinski, Lauris" uniqKey="Kaplinski L" first="Lauris" last="Kaplinski">Lauris Kaplinski</name>
<affiliation><nlm:aff id="aff-1"><institution>Department of Bioinformatics, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Mols, M Rt" sort="Mols, M Rt" uniqKey="Mols M" first="M Rt" last="Möls">M Rt Möls</name>
<affiliation><nlm:aff id="aff-1"><institution>Department of Bioinformatics, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="aff-2"><institution>Institute of Mathematical Statistics, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Andreson, Reidar" sort="Andreson, Reidar" uniqKey="Andreson R" first="Reidar" last="Andreson">Reidar Andreson</name>
<affiliation><nlm:aff id="aff-1"><institution>Department of Bioinformatics, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Lepamets, Maarja" sort="Lepamets, Maarja" uniqKey="Lepamets M" first="Maarja" last="Lepamets">Maarja Lepamets</name>
<affiliation><nlm:aff id="aff-1"><institution>Department of Bioinformatics, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="K Ressaar, Triinu" sort="K Ressaar, Triinu" uniqKey="K Ressaar T" first="Triinu" last="K Ressaar">Triinu K Ressaar</name>
<affiliation><nlm:aff id="aff-1"><institution>Department of Bioinformatics, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Naaber, Paul" sort="Naaber, Paul" uniqKey="Naaber P" first="Paul" last="Naaber">Paul Naaber</name>
<affiliation><nlm:aff id="aff-3"><institution>Synlab Eesti</institution>
,<addr-line>Tallinn</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="aff-4"><institution>Department of Microbiology, Institute of Biomedicine and Translational Medicine, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="K Ljalg, Siiri" sort="K Ljalg, Siiri" uniqKey="K Ljalg S" first="Siiri" last="K Ljalg">Siiri K Ljalg</name>
<affiliation><nlm:aff id="aff-4"><institution>Department of Microbiology, Institute of Biomedicine and Translational Medicine, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="aff-5"><institution>United Laboratories, Tartu University Clinics</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Remm, Maido" sort="Remm, Maido" uniqKey="Remm M" first="Maido" last="Remm">Maido Remm</name>
<affiliation><nlm:aff id="aff-1"><institution>Department of Bioinformatics, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">PeerJ</title>
<idno type="eISSN">2167-8359</idno>
<imprint><date when="2017">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><sec><title>Background</title>
<p>Fast, accurate and high-throughput identification of bacterial isolates is in great demand. The present work was conducted to investigate the possibility of identifying isolates from unassembled next-generation sequencing reads using custom-made guide trees.</p>
</sec>
<sec><title>Results</title>
<p>A tool named StrainSeeker was developed that constructs a list of specific <italic>k</italic>
-mers for each node of any given Newick-format tree and enables the identification of bacterial isolates in 1–2 min. It uses a novel algorithm, which analyses the observed and expected fractions of node-specific <italic>k</italic>
-mers to test the presence of each node in the sample. This allows StrainSeeker to determine where the isolate branches off the guide tree and assign it to a clade whereas other tools assign each read to a reference genome. Using a dataset of 100 <italic>Escherichia coli</italic>
 isolates, we demonstrate that StrainSeeker can predict the clades of <italic>E. coli</italic>
 with 92% accuracy and correct tree branch assignment with 98% accuracy. Twenty-five thousand Illumina HiSeq reads are sufficient for identification of the strain.</p>
</sec>
<sec><title>Conclusion</title>
<p>StrainSeeker is a software program that identifies bacterial isolates by assigning them to nodes or leaves of a custom-made guide tree. StrainSeeker’s web interface and pre-computed guide trees are available at <uri xlink:href="http://bioinfo.ut.ee/strainseeker">http://bioinfo.ut.ee/strainseeker</uri>
. Source code is stored at GitHub: <uri xlink:href="https://github.com/bioinfo-ut/StrainSeeker">https://github.com/bioinfo-ut/StrainSeeker</uri>
.</p>
</sec>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Ahn, Th" uniqKey="Ahn T">TH Ahn</name>
</author>
<author><name sortKey="Chai, J" uniqKey="Chai J">J Chai</name>
</author>
<author><name sortKey="Pan, C" uniqKey="Pan C">C Pan</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author><name sortKey="Madden, Tl" uniqKey="Madden T">TL Madden</name>
</author>
<author><name sortKey="Sch Ffer, Aa" uniqKey="Sch Ffer A">AA Schäffer</name>
</author>
<author><name sortKey="Zhang, J" uniqKey="Zhang J">J Zhang</name>
</author>
<author><name sortKey="Zhang, Z" uniqKey="Zhang Z">Z Zhang</name>
</author>
<author><name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
<author><name sortKey="Lipman, Dj" uniqKey="Lipman D">DJ Lipman</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bradley, P" uniqKey="Bradley P">P Bradley</name>
</author>
<author><name sortKey="Gordon, Nc" uniqKey="Gordon N">NC Gordon</name>
</author>
<author><name sortKey="Walker, Tm" uniqKey="Walker T">TM Walker</name>
</author>
<author><name sortKey="Dunn, L" uniqKey="Dunn L">L Dunn</name>
</author>
<author><name sortKey="Heys, S" uniqKey="Heys S">S Heys</name>
</author>
<author><name sortKey="Huang, B" uniqKey="Huang B">B Huang</name>
</author>
<author><name sortKey="Earle, S" uniqKey="Earle S">S Earle</name>
</author>
<author><name sortKey="Pankhurst, Lj" uniqKey="Pankhurst L">LJ Pankhurst</name>
</author>
<author><name sortKey="Anson, L" uniqKey="Anson L">L Anson</name>
</author>
<author><name sortKey="De Cesare, M" uniqKey="De Cesare M">M de Cesare</name>
</author>
<author><name sortKey="Piazza, P" uniqKey="Piazza P">P Piazza</name>
</author>
<author><name sortKey="Votintseva, Aa" uniqKey="Votintseva A">AA Votintseva</name>
</author>
<author><name sortKey="Golubchik, T" uniqKey="Golubchik T">T Golubchik</name>
</author>
<author><name sortKey="Wilson, Dj" uniqKey="Wilson D">DJ Wilson</name>
</author>
<author><name sortKey="Wyllie, Dh" uniqKey="Wyllie D">DH Wyllie</name>
</author>
<author><name sortKey="Diel, R" uniqKey="Diel R">R Diel</name>
</author>
<author><name sortKey="Niemann, S" uniqKey="Niemann S">S Niemann</name>
</author>
<author><name sortKey="Feuerriegel, S" uniqKey="Feuerriegel S">S Feuerriegel</name>
</author>
<author><name sortKey="Kohl, Ta" uniqKey="Kohl T">TA Kohl</name>
</author>
<author><name sortKey="Ismail, N" uniqKey="Ismail N">N Ismail</name>
</author>
<author><name sortKey="Omar, Sv" uniqKey="Omar S">SV Omar</name>
</author>
<author><name sortKey="Smith, Eg" uniqKey="Smith E">EG Smith</name>
</author>
<author><name sortKey="Buck, D" uniqKey="Buck D">D Buck</name>
</author>
<author><name sortKey="Mcvean, G" uniqKey="Mcvean G">G McVean</name>
</author>
<author><name sortKey="Walker, As" uniqKey="Walker A">AS Walker</name>
</author>
<author><name sortKey="Peto, T" uniqKey="Peto T">T Peto</name>
</author>
<author><name sortKey="Crook, D" uniqKey="Crook D">D Crook</name>
</author>
<author><name sortKey="Iqbal, Z" uniqKey="Iqbal Z">Z Iqbal</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Hasman, H" uniqKey="Hasman H">H Hasman</name>
</author>
<author><name sortKey="Saputra, D" uniqKey="Saputra D">D Saputra</name>
</author>
<author><name sortKey="Sicheritz Ponten, T" uniqKey="Sicheritz Ponten T">T Sicheritz-Ponten</name>
</author>
<author><name sortKey="Lund, O" uniqKey="Lund O">O Lund</name>
</author>
<author><name sortKey="Svendsen, Ca" uniqKey="Svendsen C">CA Svendsen</name>
</author>
<author><name sortKey="Frimodt Moller, N" uniqKey="Frimodt Moller N">N Frimodt-Moller</name>
</author>
<author><name sortKey="Aarestrup, Fm" uniqKey="Aarestrup F">FM Aarestrup</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Inouye, M" uniqKey="Inouye M">M Inouye</name>
</author>
<author><name sortKey="Dashnow, H" uniqKey="Dashnow H">H Dashnow</name>
</author>
<author><name sortKey="Raven, L A" uniqKey="Raven L">L-A Raven</name>
</author>
<author><name sortKey="Schultz, Mb" uniqKey="Schultz M">MB Schultz</name>
</author>
<author><name sortKey="Pope, Bj" uniqKey="Pope B">BJ Pope</name>
</author>
<author><name sortKey="Tomita, T" uniqKey="Tomita T">T Tomita</name>
</author>
<author><name sortKey="Zobel, J" uniqKey="Zobel J">J Zobel</name>
</author>
<author><name sortKey="Holt, Ke" uniqKey="Holt K">KE Holt</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kaplinski, L" uniqKey="Kaplinski L">L Kaplinski</name>
</author>
<author><name sortKey="Lepamets, M" uniqKey="Lepamets M">M Lepamets</name>
</author>
<author><name sortKey="Remm, M" uniqKey="Remm M">M Remm</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Karamonova, L" uniqKey="Karamonova L">L Karamonová</name>
</author>
<author><name sortKey="Junkova, P" uniqKey="Junkova P">P Junková</name>
</author>
<author><name sortKey="Mihalova, D" uniqKey="Mihalova D">D Mihalová</name>
</author>
<author><name sortKey="Javurkova, B" uniqKey="Javurkova B">B Javůrková</name>
</author>
<author><name sortKey="Fukal, L" uniqKey="Fukal L">L Fukal</name>
</author>
<author><name sortKey="Rauch, P" uniqKey="Rauch P">P Rauch</name>
</author>
<author><name sortKey="Blazkova, M" uniqKey="Blazkova M">M Blažková</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Katoh, K" uniqKey="Katoh K">K Katoh</name>
</author>
<author><name sortKey="Misawa, K" uniqKey="Misawa K">K Misawa</name>
</author>
<author><name sortKey="Kuma, K" uniqKey="Kuma K">K Kuma</name>
</author>
<author><name sortKey="Miyata, T" uniqKey="Miyata T">T Miyata</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lan, R" uniqKey="Lan R">R Lan</name>
</author>
<author><name sortKey="Reeves, Pr" uniqKey="Reeves P">PR Reeves</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Larsen, Mv" uniqKey="Larsen M">MV Larsen</name>
</author>
<author><name sortKey="Cosentino, S" uniqKey="Cosentino S">S Cosentino</name>
</author>
<author><name sortKey="Rasmussen, S" uniqKey="Rasmussen S">S Rasmussen</name>
</author>
<author><name sortKey="Friis, C" uniqKey="Friis C">C Friis</name>
</author>
<author><name sortKey="Hasman, H" uniqKey="Hasman H">H Hasman</name>
</author>
<author><name sortKey="Marvig, Rl" uniqKey="Marvig R">RL Marvig</name>
</author>
<author><name sortKey="Jelsbak, L" uniqKey="Jelsbak L">L Jelsbak</name>
</author>
<author><name sortKey="Sicheritz Ponten, T" uniqKey="Sicheritz Ponten T">T Sicheritz-Pontén</name>
</author>
<author><name sortKey="Ussery, Dw" uniqKey="Ussery D">DW Ussery</name>
</author>
<author><name sortKey="Aarestrup, Fm" uniqKey="Aarestrup F">FM Aarestrup</name>
</author>
<author><name sortKey="Lund, O" uniqKey="Lund O">O Lund</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lindgreen, S" uniqKey="Lindgreen S">S Lindgreen</name>
</author>
<author><name sortKey="Adair, Kl" uniqKey="Adair K">KL Adair</name>
</author>
<author><name sortKey="Gardner, Pp" uniqKey="Gardner P">PP Gardner</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Maiden, Mcj" uniqKey="Maiden M">MCJ Maiden</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ogura, Y" uniqKey="Ogura Y">Y Ogura</name>
</author>
<author><name sortKey="Ooka, T" uniqKey="Ooka T">T Ooka</name>
</author>
<author><name sortKey="Iguchi, A" uniqKey="Iguchi A">A Iguchi</name>
</author>
<author><name sortKey="Toh, H" uniqKey="Toh H">H Toh</name>
</author>
<author><name sortKey="Asadulghani, M" uniqKey="Asadulghani M">M Asadulghani</name>
</author>
<author><name sortKey="Oshima, K" uniqKey="Oshima K">K Oshima</name>
</author>
<author><name sortKey="Kodama, T" uniqKey="Kodama T">T Kodama</name>
</author>
<author><name sortKey="Abe, H" uniqKey="Abe H">H Abe</name>
</author>
<author><name sortKey="Nakayama, K" uniqKey="Nakayama K">K Nakayama</name>
</author>
<author><name sortKey="Kurokawa, K" uniqKey="Kurokawa K">K Kurokawa</name>
</author>
<author><name sortKey="Tobe, T" uniqKey="Tobe T">T Tobe</name>
</author>
<author><name sortKey="Hattori, M" uniqKey="Hattori M">M Hattori</name>
</author>
<author><name sortKey="Hayashi, T" uniqKey="Hayashi T">T Hayashi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ondov, Bd" uniqKey="Ondov B">BD Ondov</name>
</author>
<author><name sortKey="Treangen, Tj" uniqKey="Treangen T">TJ Treangen</name>
</author>
<author><name sortKey="Mallonee, Ab" uniqKey="Mallonee A">AB Mallonee</name>
</author>
<author><name sortKey="Bergman, Nh" uniqKey="Bergman N">NH Bergman</name>
</author>
<author><name sortKey="Koren, S" uniqKey="Koren S">S Koren</name>
</author>
<author><name sortKey="Phillippy, Am" uniqKey="Phillippy A">AM Phillippy</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ounit, R" uniqKey="Ounit R">R Ounit</name>
</author>
<author><name sortKey="Wanamaker, S" uniqKey="Wanamaker S">S Wanamaker</name>
</author>
<author><name sortKey="Close, Tj" uniqKey="Close T">TJ Close</name>
</author>
<author><name sortKey="Lonardi, S" uniqKey="Lonardi S">S Lonardi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Peabody, Ma" uniqKey="Peabody M">MA Peabody</name>
</author>
<author><name sortKey="Van Rossum, T" uniqKey="Van Rossum T">T Van Rossum</name>
</author>
<author><name sortKey="Lo, R" uniqKey="Lo R">R Lo</name>
</author>
<author><name sortKey="Brinkman, Fsl" uniqKey="Brinkman F">FSL Brinkman</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Petty, Nk" uniqKey="Petty N">NK Petty</name>
</author>
<author><name sortKey="Ben Zakour, Nl" uniqKey="Ben Zakour N">NL Ben Zakour</name>
</author>
<author><name sortKey="Stanton Cook, M" uniqKey="Stanton Cook M">M Stanton-Cook</name>
</author>
<author><name sortKey="Skippington, E" uniqKey="Skippington E">E Skippington</name>
</author>
<author><name sortKey="Totsika, M" uniqKey="Totsika M">M Totsika</name>
</author>
<author><name sortKey="Forde, Bm" uniqKey="Forde B">BM Forde</name>
</author>
<author><name sortKey="Phan, M D" uniqKey="Phan M">M-D Phan</name>
</author>
<author><name sortKey="Gomes Moriel, D" uniqKey="Gomes Moriel D">D Gomes Moriel</name>
</author>
<author><name sortKey="Peters, Km" uniqKey="Peters K">KM Peters</name>
</author>
<author><name sortKey="Davies, M" uniqKey="Davies M">M Davies</name>
</author>
<author><name sortKey="Rogers, Ba" uniqKey="Rogers B">BA Rogers</name>
</author>
<author><name sortKey="Dougan, G" uniqKey="Dougan G">G Dougan</name>
</author>
<author><name sortKey="Rodriguez Ba O, J" uniqKey="Rodriguez Ba O J">J Rodriguez-Baño</name>
</author>
<author><name sortKey="Pascual, A" uniqKey="Pascual A">A Pascual</name>
</author>
<author><name sortKey="Pitout, Jdd" uniqKey="Pitout J">JDD Pitout</name>
</author>
<author><name sortKey="Upton, M" uniqKey="Upton M">M Upton</name>
</author>
<author><name sortKey="Paterson, Dl" uniqKey="Paterson D">DL Paterson</name>
</author>
<author><name sortKey="Walsh, Tr" uniqKey="Walsh T">TR Walsh</name>
</author>
<author><name sortKey="Schembri, Ma" uniqKey="Schembri M">MA Schembri</name>
</author>
<author><name sortKey="Beatson, Sa" uniqKey="Beatson S">SA Beatson</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Saputra, D" uniqKey="Saputra D">D Saputra</name>
</author>
<author><name sortKey="Rasmussen, S" uniqKey="Rasmussen S">S Rasmussen</name>
</author>
<author><name sortKey="Larsen, Mv" uniqKey="Larsen M">MV Larsen</name>
</author>
<author><name sortKey="Haddad, N" uniqKey="Haddad N">N Haddad</name>
</author>
<author><name sortKey="Sperotto, Mm" uniqKey="Sperotto M">MM Sperotto</name>
</author>
<author><name sortKey="Aarestrup, Fm" uniqKey="Aarestrup F">FM Aarestrup</name>
</author>
<author><name sortKey="Lund, O" uniqKey="Lund O">O Lund</name>
</author>
<author><name sortKey="Sicheritz Ponten, T" uniqKey="Sicheritz Ponten T">T Sicheritz-Pontén</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Steiner, A" uniqKey="Steiner A">A Steiner</name>
</author>
<author><name sortKey="Stucki, D" uniqKey="Stucki D">D Stucki</name>
</author>
<author><name sortKey="Coscolla, M" uniqKey="Coscolla M">M Coscolla</name>
</author>
<author><name sortKey="Borrell, S" uniqKey="Borrell S">S Borrell</name>
</author>
<author><name sortKey="Gagneux, S" uniqKey="Gagneux S">S Gagneux</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Tamura, K" uniqKey="Tamura K">K Tamura</name>
</author>
<author><name sortKey="Stecher, G" uniqKey="Stecher G">G Stecher</name>
</author>
<author><name sortKey="Peterson, D" uniqKey="Peterson D">D Peterson</name>
</author>
<author><name sortKey="Filipski, A" uniqKey="Filipski A">A Filipski</name>
</author>
<author><name sortKey="Kumar, S" uniqKey="Kumar S">S Kumar</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Tu, Q" uniqKey="Tu Q">Q Tu</name>
</author>
<author><name sortKey="He, Z" uniqKey="He Z">Z He</name>
</author>
<author><name sortKey="Zhou, J" uniqKey="Zhou J">J Zhou</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wood, De" uniqKey="Wood D">DE Wood</name>
</author>
<author><name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zerbino, Dr" uniqKey="Zerbino D">DR Zerbino</name>
</author>
<author><name sortKey="Birney, E" uniqKey="Birney E">E Birney</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">PeerJ</journal-id>
<journal-id journal-id-type="iso-abbrev">PeerJ</journal-id>
<journal-id journal-id-type="pmc">PeerJ</journal-id>
<journal-id journal-id-type="publisher-id">PeerJ</journal-id>
<journal-title-group><journal-title>PeerJ</journal-title>
</journal-title-group>
<issn pub-type="epub">2167-8359</issn>
<publisher><publisher-name>PeerJ Inc.</publisher-name>
<publisher-loc>San Francisco, USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">28533988</article-id>
<article-id pub-id-type="pmc">5438578</article-id>
<article-id pub-id-type="publisher-id">3353</article-id>
<article-id pub-id-type="doi">10.7717/peerj.3353</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Bioinformatics</subject>
</subj-group>
<subj-group subj-group-type="heading"><subject>Computational Biology</subject>
</subj-group>
<subj-group subj-group-type="heading"><subject>Microbiology</subject>
</subj-group>
</article-categories>
<title-group><article-title>StrainSeeker: fast identification of bacterial strains from raw sequencing reads using user-provided guide trees</article-title>
</title-group>
<contrib-group><contrib id="author-1" contrib-type="author" corresp="yes"><name><surname>Roosaare</surname>
<given-names>Märt</given-names>
</name>
<xref ref-type="aff" rid="aff-1">1</xref>
<email>mrt.roos@gmail.com</email>
</contrib>
<contrib id="author-2" contrib-type="author"><name><surname>Vaher</surname>
<given-names>Mihkel</given-names>
</name>
<xref ref-type="aff" rid="aff-1">1</xref>
</contrib>
<contrib id="author-3" contrib-type="author"><name><surname>Kaplinski</surname>
<given-names>Lauris</given-names>
</name>
<xref ref-type="aff" rid="aff-1">1</xref>
</contrib>
<contrib id="author-4" contrib-type="author"><name><surname>Möls</surname>
<given-names>Märt</given-names>
</name>
<xref ref-type="aff" rid="aff-1">1</xref>
<xref ref-type="aff" rid="aff-2">2</xref>
</contrib>
<contrib id="author-5" contrib-type="author"><name><surname>Andreson</surname>
<given-names>Reidar</given-names>
</name>
<xref ref-type="aff" rid="aff-1">1</xref>
</contrib>
<contrib id="author-6" contrib-type="author"><name><surname>Lepamets</surname>
<given-names>Maarja</given-names>
</name>
<xref ref-type="aff" rid="aff-1">1</xref>
</contrib>
<contrib id="author-7" contrib-type="author"><name><surname>Kõressaar</surname>
<given-names>Triinu</given-names>
</name>
<xref ref-type="aff" rid="aff-1">1</xref>
</contrib>
<contrib id="author-8" contrib-type="author"><name><surname>Naaber</surname>
<given-names>Paul</given-names>
</name>
<xref ref-type="aff" rid="aff-3">3</xref>
<xref ref-type="aff" rid="aff-4">4</xref>
</contrib>
<contrib id="author-9" contrib-type="author"><name><surname>Kõljalg</surname>
<given-names>Siiri</given-names>
</name>
<xref ref-type="aff" rid="aff-4">4</xref>
<xref ref-type="aff" rid="aff-5">5</xref>
</contrib>
<contrib id="author-10" contrib-type="author"><name><surname>Remm</surname>
<given-names>Maido</given-names>
</name>
<xref ref-type="aff" rid="aff-1">1</xref>
</contrib>
<aff id="aff-1"><label>1</label>
<institution>Department of Bioinformatics, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</aff>
<aff id="aff-2"><label>2</label>
<institution>Institute of Mathematical Statistics, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</aff>
<aff id="aff-3"><label>3</label>
<institution>Synlab Eesti</institution>
,<addr-line>Tallinn</addr-line>
,<country>Estonia</country>
</aff>
<aff id="aff-4"><label>4</label>
<institution>Department of Microbiology, Institute of Biomedicine and Translational Medicine, University of Tartu</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</aff>
<aff id="aff-5"><label>5</label>
<institution>United Laboratories, Tartu University Clinics</institution>
,<addr-line>Tartu</addr-line>
,<country>Estonia</country>
</aff>
</contrib-group>
<contrib-group><contrib contrib-type="editor"><name><surname>Lazo</surname>
<given-names>Gerard</given-names>
</name>
</contrib>
</contrib-group>
<pub-date pub-type="epub" date-type="pub" iso-8601-date="2017-05-18"><day>18</day>
<month>5</month>
<year iso-8601-date="2017">2017</year>
</pub-date>
<pub-date pub-type="collection"><year>2017</year>
</pub-date>
<volume>5</volume>
<elocation-id>e3353</elocation-id>
<history><date date-type="received" iso-8601-date="2017-02-24"><day>24</day>
<month>2</month>
<year iso-8601-date="2017">2017</year>
</date>
<date date-type="accepted" iso-8601-date="2017-04-26"><day>26</day>
<month>4</month>
<year iso-8601-date="2017">2017</year>
</date>
</history>
<permissions><copyright-statement>© 2017 Roosaare et al.</copyright-statement>
<copyright-year>2017</copyright-year>
<copyright-holder>Roosaare et al.</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><license-p>This is an open access article distributed under the terms of the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License</ext-link>
, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="https://peerj.com/articles/3353"></self-uri>
<abstract><sec><title>Background</title>
<p>Fast, accurate and high-throughput identification of bacterial isolates is in great demand. The present work was conducted to investigate the possibility of identifying isolates from unassembled next-generation sequencing reads using custom-made guide trees.</p>
</sec>
<sec><title>Results</title>
<p>A tool named StrainSeeker was developed that constructs a list of specific <italic>k</italic>
-mers for each node of any given Newick-format tree and enables the identification of bacterial isolates in 1–2 min. It uses a novel algorithm, which analyses the observed and expected fractions of node-specific <italic>k</italic>
-mers to test the presence of each node in the sample. This allows StrainSeeker to determine where the isolate branches off the guide tree and assign it to a clade whereas other tools assign each read to a reference genome. Using a dataset of 100 <italic>Escherichia coli</italic>
 isolates, we demonstrate that StrainSeeker can predict the clades of <italic>E. coli</italic>
 with 92% accuracy and correct tree branch assignment with 98% accuracy. Twenty-five thousand Illumina HiSeq reads are sufficient for identification of the strain.</p>
</sec>
<sec><title>Conclusion</title>
<p>StrainSeeker is a software program that identifies bacterial isolates by assigning them to nodes or leaves of a custom-made guide tree. StrainSeeker’s web interface and pre-computed guide trees are available at <uri xlink:href="http://bioinfo.ut.ee/strainseeker">http://bioinfo.ut.ee/strainseeker</uri>
. Source code is stored at GitHub: <uri xlink:href="https://github.com/bioinfo-ut/StrainSeeker">https://github.com/bioinfo-ut/StrainSeeker</uri>
.</p>
</sec>
</abstract>
<kwd-group kwd-group-type="author"><kwd><italic>k</italic>
-mer</kwd>
<kwd>Clade</kwd>
<kwd>Strain identification</kwd>
<kwd>Species identification</kwd>
<kwd>Diagnostics</kwd>
</kwd-group>
<funding-group><award-group id="fund-1"><funding-source>European Union through the European Regional Development Fund through Estonian Centre of Excellence in Genomics and Translational Medicine</funding-source>
<award-id>2014-2020.4.01.15-0012, 3.2.0701.11-0013</award-id>
</award-group>
<award-group id="fund-2"><funding-source>Estonian Ministry of Education and Research</funding-source>
<award-id>IUT34-11, SF0180132s08, KOGU-HUMB</award-id>
</award-group>
<award-group id="fund-3"><funding-source>Baltic Antibiotic Resistance collaborative Network (BARN)</funding-source>
</award-group>
<award-group id="fund-4"><funding-source>Estonian Research Council</funding-source>
<award-id>IUT34-19</award-id>
</award-group>
<award-group id="fund-5"><funding-source>Estonian Science Foundation</funding-source>
<award-id>9059</award-id>
</award-group>
<funding-statement>This work was supported by the European Union through the European Regional Development Fund through Estonian Centre of Excellence in Genomics and Translational Medicine (project No. 2014-2020.4.01.15-0012) and project ARMMD (No. 3.2.0701.11-0013), by the Estonian Ministry of Education and Research (institutional grant IUT34-11, target financing grants SF0180132s08 and KOGU-HUMB), by the Baltic Antibiotic Resistance collaborative Network (BARN), by the Estonian Research Council (grant No. IUT34-19) and by the Estonian Science Foundation (grant No. 9059). There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.</funding-statement>
</funding-group>
</article-meta>
</front>
<body><sec sec-type="intro"><title>Introduction</title>
<p>Pathogenic bacteria represent a considerable danger for human health worldwide. For effective outbreak detection and epidemiological surveillance, bacterial pathogens must be rapidly identified. For this, the pathogen is usually isolated and various molecular typing methods used, most are based on polymerase chain reaction, or, in the last few years, whole-genome sequencing (WGS) (<xref rid="ref-5" ref-type="bibr">Inouye et al., 2014</xref>
; <xref rid="ref-4" ref-type="bibr">Hasman et al., 2014</xref>
). Matrix-assisted laser desorption/ionization time-of-flight mass spectrometry has also been used to quickly and cheaply identify bacterial colonies (<xref rid="ref-7" ref-type="bibr">Karamonová et al., 2013</xref>
), but for strain-level identification it requires very precise, manually crafted databases for each species which, to a large extent, are not available today.</p>
<p>One of the main goals of molecular typing is classification of pathogens into clonal groups (<xref rid="ref-5" ref-type="bibr">Inouye et al., 2014</xref>
). This is important because strains from the same species can have vastly different effects on their host. A well-known example is <italic>Escherichia coli</italic>
, a species which contains some strains such as <italic>E. coli</italic>
 O157:H7 (<xref rid="ref-21" ref-type="bibr">Tu, He & Zhou, 2014</xref>
) and <italic>E. coli</italic>
 EC958 (<xref rid="ref-17" ref-type="bibr">Petty et al., 2014</xref>
) that are considerably more virulent than others. For classifying isolates, multi-locus sequence typing (MLST) (<xref rid="ref-12" ref-type="bibr">Maiden, 2006</xref>
) or clone-specific markers have been used (<xref rid="ref-5" ref-type="bibr">Inouye et al., 2014</xref>
). Several approaches have been developed that can detect clinically relevant mutations and alleles directly from WGS reads, such as KvarQ (<xref rid="ref-19" ref-type="bibr">Steiner et al., 2014</xref>
), Mykrobe (<xref rid="ref-3" ref-type="bibr">Bradley et al., 2015</xref>
) and SRST2 (<xref rid="ref-5" ref-type="bibr">Inouye et al., 2014</xref>
). However, in most cases deep sequencing coverage and highly specialized allele databases are required (e.g., Mykrobe can be used only for <italic>Mycobacterium tuberculosis</italic>
 and <italic>Staphylococcus aureus</italic>
 identification), making the use of such programs complicated for the identification of isolates. The Reads2Type web service (<xref rid="ref-18" ref-type="bibr">Saputra et al., 2015</xref>
) can be used for the rapid taxonomical identification of any bacterial isolate, but only at the species level. To classify an isolate to a clonal group, higher resolution is necessary.</p>
<p>Instead of looking for a set of clone-specific markers, full bacterial genomes could be used as the reference sequences. Bacterial identification programs based on the detection of short DNA oligomers with length <italic>k</italic>
 (<italic>k</italic>
-mers) such as Kraken (<xref rid="ref-22" ref-type="bibr">Wood & Salzberg, 2014</xref>
) or CLARK (<xref rid="ref-15" ref-type="bibr">Ounit et al., 2015</xref>
) can use the whole RefSeq bacterial genomes database and identify isolates with high accuracy (<xref rid="ref-18" ref-type="bibr">Saputra et al., 2015</xref>
). Moreover, they can handle low-coverage WGS samples as well, because they classify each read separately. Compared to the alignment-based tools like Sigma (<xref rid="ref-1" ref-type="bibr">Ahn, Chai & Pan, 2015</xref>
), <italic>k</italic>
-mer based programs have shown to be superior, especially when considering running time (<xref rid="ref-11" ref-type="bibr">Lindgreen, Adair & Gardner, 2016</xref>
; <xref rid="ref-16" ref-type="bibr">Peabody et al., 2015</xref>
). Kraken identifies each of the sequence reads separately using the National Center for Biotechnology Information (NCBI) taxonomy tree, counting the hits to each of the taxons on the tree and finding the branch with the most total hits. CLARK also identifies each of the reads, but instead of using a tree, it is based on a non-hierarchical, user-defined database.</p>
<p>We present StrainSeeker, a program for quick classification of bacterial isolates into clonal groups or clades direct from raw WGS sequencing reads. StrainSeeker uses a guide tree to approximate phylogenetic relationships between reference bacterial genomes down to the strain level, not being tied to existing taxonomic systems such as the NCBI taxonomy. This helps to avoid controversies such as the case of <italic>E. coli</italic>
 and <italic>Shigella</italic>
 sp., by which <italic>Shigella</italic>
 strains have been shown to be phylogenetically very similar to <italic>E. coli</italic>
, but belong to different species according to NCBI taxonomy (<xref rid="ref-9" ref-type="bibr">Lan & Reeves, 2002</xref>
). The guide tree has to be provided by the user. We developed a novel algorithm that assigns the isolate to a specific clade on the guide tree, based on the number of shared <italic>k</italic>
-mers on different taxonomic levels. Instead of read counts assigned to individual reference genomes, StrainSeeker results are given as a single strain or a clade consisting of multiple strains, along with a visual representation of the guide tree showing where the isolate branches off.</p>
<sec><title>Implementation</title>
<p>StrainSeeker is designed to analyze raw WGS sequencing reads and quickly determine the clade of the isolate in the user-provided guide tree. Before StrainSeeker can be used to identify bacteria, the database of specific <italic>k</italic>
-mers needs to be built or a pre-built one downloaded. To create a database, the user needs to provide a set of high-quality assembled bacterial strain genomes and a guide tree describing the approximate phylogeny of provided strains (<xref ref-type="fig" rid="fig-1">Fig. 1</xref>
). Any Newick-format tree can be used as the guide tree. The database is built according to the guide tree structure, starting from the leaves (individual strains) and moving toward the root. All operations with <italic>k</italic>
-mers are done using the GenomeTester4 software (<xref rid="ref-6" ref-type="bibr">Kaplinski, Lepamets & Remm, 2015</xref>
). To reduce the noise in samples that may be caused by the DNA of other, non-bacterial organisms such as human DNA in clinical samples, the user can also provide a list of potential contaminating sequences (the “blacklist”). In the database building process, all strain <italic>k</italic>
-mers that are present in the blacklist are eliminated. The blacklist itself is not part of the database. The final database contains specific <italic>k</italic>
-mers for each node and leaf (strain) represented in the guide tree and an index file containing the database structure and <italic>k</italic>
-mer counts. The database has to be built only once, not for every identification.</p>
<fig id="fig-1" orientation="portrait" position="float"><object-id pub-id-type="doi">10.7717/peerj.3353/fig-1</object-id>
<label>Figure 1</label>
<caption><title>StrainSeeker database building process.</title>
<p>Database construction requires high-quality assembled bacterial genomes as input. Next, the user has to build a guide tree that contains all the input strains. Before the building process, the assembled genome of each strain is converted into a <italic>k</italic>
-mer list. Building process starts from the strain level and moves shared <italic>k</italic>
-mers toward the root. The final step is to eliminate non-specific <italic>k</italic>
-mers that occur in the “blacklist” or in any other nodes. The finished database contains <italic>k</italic>
-mer lists specific to each node and strain and can be used to quickly identify any strain included on the guide tree and the strains related to them.</p>
</caption>
<graphic xlink:href="peerj-05-3353-g001"></graphic>
</fig>
<p>The search process follows the same guide tree structure. The search is recursive, starting the analysis of node-specific <italic>k</italic>
-mers at the root node of the tree and moving down toward the potential newly characterized strains (<xref ref-type="fig" rid="fig-2">Fig. 2</xref>
). Depending on where the isolate branches off the guide tree, the result is given as a single strain or a clade (<xref ref-type="fig" rid="fig-3">Fig. 3</xref>
). In case of multiple strains present in the sample, all are reported with their respective fractions, which helps to detect contamination in the sample. StrainSeeker is implemented in PERL and can be run either as a stand-alone program on a UNIX server or as a web service. The output format of StrainSeeker is either a text file or a visualized result (<xref ref-type="supplementary-material" rid="supp-1">Fig. S1A</xref>
).</p>
<fig id="fig-2" orientation="portrait" position="float"><object-id pub-id-type="doi">10.7717/peerj.3353/fig-2</object-id>
<label>Figure 2</label>
<caption><title>Strain identification process.</title>
<p>After the sample is sequenced, the reads are converted into <italic>k</italic>
-mers. Sample <italic>k</italic>
-mers that are also present in the database (marked blue) are counted in their respective nodes (mapping to the tree). The search starts from root node (N1) and recursively moves down to the subnodes. First, the fraction of observed <italic>k</italic>
-mers <italic>O</italic>
 is calculated (for details, see Methods). For the identification process to continue along the current branch, <italic>O</italic>
 is required to exceed a cutoff calculated for each node, otherwise the process stops (N2). Then, an observed/expected value <italic>O</italic>
/<italic>E</italic>
 is calculated (see Methods) for nodes or the strain is shown as the result for the leaves. The process continues to the subnodes if <italic>O</italic>
/<italic>E</italic>
 does not significantly differ from 1, showing that the current node is present in the sample and there is no branching to unknown nodes and strains. <italic>O</italic>
/<italic>E</italic>
 values that are significantly higher than 1 (N4) indicate the presence of an unknown strain that belongs to the clade N4. <italic>O</italic>
/<italic>E</italic>
 values that are significantly lower than 1 (N5) indicate possible sequencing errors and the process stops. The result above shows that a strain was found that belongs to the clade N4, which contains strains S3 and S4.</p>
</caption>
<graphic xlink:href="peerj-05-3353-g002"></graphic>
</fig>
<fig id="fig-3" orientation="portrait" position="float"><object-id pub-id-type="doi">10.7717/peerj.3353/fig-3</object-id>
<label>Figure 3</label>
<caption><title>A visualized StrainSeeker result.</title>
<p>The identified strain is marked with a red arrow and box. Percentage (100%) indicates that only this strain was found from the sample. Green line marks the path of StrainSeeker’s identification process, roman numerals mark the successive nodes that were present in the sample according to the algorithm. <italic>N</italic>
 indicates the number of <italic>k</italic>
-mers specific to this node that were found in the sample, <italic>N</italic>
<sub>max</sub>
 indicates the maximum number of <italic>k</italic>
-mers specific to the node. Each strain name is given as follows: [Multi-locus sequence type] [Strain name] [RefSeq identifier] [NCBI accession number]. The tree shown is a branch of the “Mash-based guide tree” (see Methods). The isolate in the example forms a small clade with the <italic>Escherichia coli</italic>
 strain JJ1886 and is marked “NOVEL,” indicating that it is a strain closely related to JJ1886, but not this exact strain. This is determined in step VI according to our algorithm (see Methods).</p>
</caption>
<graphic xlink:href="peerj-05-3353-g003"></graphic>
</fig>
</sec>
</sec>
<sec sec-type="materials|methods"><title>Materials and Methods</title>
<sec><title>Building the guide trees</title>
<p>Four trees were built in total—three were used as guide trees for database building, one was used as a reference to determine the clades of 100 <italic>E. coli</italic>
 isolates used in testing.</p>
<p>Two <italic>E. coli</italic>
 multiple gene alignment-based trees were built. First, the “gene alignment-based guide tree” contained 74 <italic>E. coli</italic>
 strains from the NCBI RefSeq database (release 69). Second, the “gene alignment-based reference tree” contained the same 74 strains from RefSeq and also 100 <italic>E. coli</italic>
 isolates that were used to test the performance of StrainSeeker (<xref ref-type="supplementary-material" rid="supp-3">Table S1</xref>
). We defined a clade by the phylogenetic distance—all the strains separated by less than 0.001 nucleotide substitutions per site were considered a clade (<xref ref-type="fig" rid="fig-4">Fig. 4A</xref>
). As the true strain-level identity of the isolates were not known, we used the “gene alignment-based reference tree” to determine the clades of the 100 isolates (<xref ref-type="supplementary-material" rid="supp-2">Fig. S2</xref>
). Similar phylogenetic trees have been used before for <italic>E. coli</italic>
 phylogenetic analysis (<xref rid="ref-13" ref-type="bibr">Ogura et al., 2009</xref>
). Multiple alignments for both trees were built in a similar fashion. We extracted all <italic>E. coli</italic>
 genomic proteins from the UniProtKB/Swiss-Prot database (accessed 6/10/2016) and used TBLASTN 2.2.30 (<xref rid="ref-2" ref-type="bibr">Altschul et al., 1997</xref>
) (match identity ≥90%, match coverage ≥95%) to check which proteins were present in each of the 174 <italic>E. coli</italic>
 genomes. The nucleotide sequences of 126 genes shared between all these strains (<xref ref-type="supplementary-material" rid="supp-4">Table S2</xref>
) were concatenated and a multiple alignment built with MAFFT v7.305b (parameters—<italic>maxiterate 1000</italic>
) (<xref rid="ref-8" ref-type="bibr">Katoh et al., 2002</xref>
). Trees were built with MEGA (<xref rid="ref-20" ref-type="bibr">Tamura et al., 2013</xref>
), using neighbor-joining method and 500 bootstrap iterations.</p>
<fig id="fig-4" orientation="portrait" position="float"><object-id pub-id-type="doi">10.7717/peerj.3353/fig-4</object-id>
<label>Figure 4</label>
<caption><title>StrainSeeker testing workflow and error types.</title>
<p>(A) The workflow that we used to build the “gene alignment-based reference tree” which was used to determine the true clade of each of the 100 <italic>E. coli</italic>
 isolates used in testing. Programs applied are marked on top of the arrows. (B) Three types of errors that StrainSeeker can make. Path of the search process is marked with a dashed red line. Type 1 error indicates that the search process stops (marked with N) before it reaches the correct clade and a larger clade containing Clade 1 and Clade 2 is reported. Type 2 error means that an incorrect clade (Clade 2) is reported instead of the true one (Clade 1). Type 3 error signifies an ambiguous result in which more than a single clade is reported (Clade 1 and additionally, Clade 3), each with a relative fraction in sample above 5%.</p>
</caption>
<graphic xlink:href="peerj-05-3353-g004"></graphic>
</fig>
<p>The other two guide trees were built using a distance matrix made with an alignment-free, <italic>k</italic>
-mer based method Mash (<xref rid="ref-14" ref-type="bibr">Ondov et al., 2016</xref>
) (parameters <italic>s</italic>
 = 10,000, <italic>k</italic>
 = 21). The first, “Mash-based guide tree,” contained the same 74 <italic>E. coli</italic>
 strains as the “gene-based guide tree.” The other, “Large Mash-based guide tree,” contained all 4,324 available bacterial genomes from the NCBI RefSeq database (release 69). Trees were constructed with MEGA6 (<xref rid="ref-20" ref-type="bibr">Tamura et al., 2013</xref>
) using the unweighted pair group method with arithmetic mean (UPGMA).</p>
</sec>
<sec><title>StrainSeeker databases</title>
<p>We created all the databases using the GenomeTester4 software (<xref rid="ref-6" ref-type="bibr">Kaplinski, Lepamets & Remm, 2015</xref>
). Nineteen databases with different <italic>k</italic>
-mer lengths were built altogether, 14 contained 74 <italic>E. coli</italic>
 genomes obtained from the NCBI RefSeq database and were based on either the “gene alignment-based guide tree” or “Mash-based guide tree” (<italic>k</italic>
 ∈ <italic>K</italic>
; <italic>K</italic>
 = {14, 15, 16, 20, 24, 28, 32}); five databases contained 4,324 bacterial genomes from the NCBI RefSeq database were based on the “Large Mash-based guide tree” (<italic>k</italic>
 ∈ <italic>K</italic>
; <italic>K</italic>
 = {16, 20, 24, 28, 32}). Databases based on the “Large Mash-based guide tree” (<italic>k < 16</italic>
) contained <italic>E. coli</italic>
 strains without any specific <italic>k</italic>
-mers. These were omitted in the performance testing.</p>
</sec>
<sec><title>StrainSeeker identification algorithm</title>
<p>First, the algorithm converts sequencing reads to a <italic>k</italic>
-mer list (<xref ref-type="fig" rid="fig-2">Fig. 2</xref>
). Reads are converted to <italic>k</italic>
-mers using a sliding window with a single nucleotide step. <italic>K</italic>
-mers containing ambiguous nucleotides are removed. In the database, each guide tree node has a number of <italic>k</italic>
-mers specific to it, referred to as “node-specific <italic>k</italic>
-mers.”</p>
<p>The identification process starts at the root and recursively moves down toward the leaves. For each step, the percentage of observed <italic>k</italic>
-mers <italic>O</italic>
 is calculated for the current node by dividing the number of node-specific <italic>k</italic>
-mers <italic>N</italic>
 found in the sample with the total number of node-specific <italic>k</italic>
-mers <italic>N</italic>
<sub>max</sub>
: <italic>O = N/N</italic>
<sub>max</sub>
. If <italic>O</italic>
 is below a minimum level, calculated based on the total number of <italic>k</italic>
-mers in the node, the search will not continue further. Otherwise, an observed/expected ratio (<italic>O</italic>
/<italic>E</italic>
) of node-specific <italic>k</italic>
-mers is calculated. The expected number of <italic>k</italic>
-mers for the given node is the number of <italic>k</italic>-mers that should be observed, if bacteria from either of the two sub-clades of the node are present in sample. For a node A with children B and C it is as follows:
<disp-formula id="eqn-1"><alternatives><graphic xlink:href="peerj-05-3353-e001.jpg" mimetype="image" mime-subtype="png" position="float" orientation="portrait"></graphic>
<tex-math id="M1">\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$$O/E = {O_{\rm{A}}} \div \left({{O_{\rm{B}}} + {O_{\rm{C}}}-{O_{\rm{B}}} \cdot {O_{\rm{C}}}} \right).$$\end{document}</tex-math>
<mml:math id="mml-eqn-1"><mml:mrow><mml:mi>O</mml:mi>
<mml:mo>/</mml:mo>
<mml:mi>E</mml:mi>
<mml:mo>=</mml:mo>
<mml:msub><mml:mi>O</mml:mi>
<mml:mtext>A</mml:mtext>
</mml:msub>
<mml:mo>÷</mml:mo>
<mml:mrow><mml:mo>(</mml:mo>
<mml:mrow><mml:msub><mml:mi>O</mml:mi>
<mml:mtext>B</mml:mtext>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub><mml:mi>O</mml:mi>
<mml:mtext>C</mml:mtext>
</mml:msub>
<mml:mo>−</mml:mo>
<mml:msub><mml:mi>O</mml:mi>
<mml:mtext>B</mml:mtext>
</mml:msub>
<mml:mo>⋅</mml:mo>
<mml:msub><mml:mi>O</mml:mi>
<mml:mtext>C</mml:mtext>
</mml:msub>
</mml:mrow>
<mml:mo>).</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</alternatives>
</disp-formula>
<italic>O</italic>
/<italic>E</italic>
 < 1, indicates mainly sequencing errors; <italic>O</italic>
/<italic>E</italic>
 > 1, indicates that that there is a strain that is related to the given node but not to either of its sub-nodes; <italic>O</italic>
/<italic>E</italic>
 ≈ 1, indicates that at least one of the sub-nodes are present in the sample. We used an asymptotic test with a significance level of 5 × 10<sup>−5</sup>
 to test the hypothesis that <italic>O</italic>
/<italic>E</italic>
 = 1 (<xref ref-type="supplementary-material" rid="supp-5">Article S1</xref>
). If we cannot reject the hypothesis, the search will continue in sub-nodes with <italic>O</italic>
 and <italic>O</italic>
/<italic>E</italic>
 calculated and checked at each step until either the strain level is reached or the hypothesis is rejected. If we reject the hypothesis, the current branch is either discarded because its apparent presence was due to noise (<italic>O</italic>
/<italic>E</italic>
 < 1) or all strains under the current node will be reported in the output (<italic>O</italic>
/<italic>E</italic>
 > 1).</p>
<p>To calculate the relative genome fractions in case of multiple strains present in the sample, we assumed that the number of times a <italic>k</italic>
-mer is seen follows Poisson distribution. To reduce the influence of possible errors (either due to sequencing errors or a <italic>k</italic>
-mer not being unique), we sorted the <italic>k</italic>
-mer list by frequency and removed both the top 10% and the bottom 10% of <italic>k</italic>
-mers from the list. We calculated the mean coverage from the remaining <italic>k</italic>
-mers. Based on the mean of truncated observations, the mean of non-truncated Poisson distribution is estimated using the maximum-likelihood estimation.</p>
</sec>
<sec><title><italic>Escherichia coli</italic>
 and <italic>Klebsiella pneumoniae</italic>
 isolation, DNA sequencing and initial identification</title>
<p><italic>Escherichia coli</italic>
 and <italic>Klebsiella pneumoniae</italic>
 strains used for testing were isolated from samples taken from hospitals during the project BARN. Full description and assembled sequences of the strains will be published elsewhere. Raw reads of all the strains used for testing are available at the European Nucleotide Archive (study accession <uri xlink:href="http://www.ebi.ac.uk/ena/data/view/PRJEB20419">PRJEB20419</uri>
). The strains were isolated from different clinical materials: blood, pus, urine and the respiratory tract. Initial bacterial identification was performed using MALDI-TOF MS (Maldi Biotyper, BrukerDaltonics GmbH, Germany). DNA templates for sequencing were generated by growing isolate cultures overnight on blood agar (Oxoid Limited, Hampshire, UK). Total DNA from the bacterial strains was extracted using the QIAamp DNA Mini Kit (Qiagen, Hilden, Germany) and quantified using the Qubit® 2.0 Fluorometer (Invitrogen, Grand Island, NY, USA). A total of 1 ng of sample DNA was processed for the sequencing libraries using the Illumina Nextera XT sample preparation kit (Illumina, San Diego, CA, USA) according to the manufacturer’s instructions. The DNA normalization step was skipped; instead, the final dsDNA libraries were quantified with the Qubit® 2.0 Fluorometer and pooled in equimolar concentrations. The library pool was validated with 2200 TapeStation (Agilent Technologies, Santa Clara, CA, USA) measurements, and qPCR was performed with the Kapa Library Quantification Kit (Kapa Biosystems, Woburn, MA, USA) to optimize cluster generation. A total of 667 <italic>E. coli</italic>
 and 539 <italic>K. pneumoniae</italic>
 genomic libraries were sequenced with 2 × 101 base pair (bp) paired-end reads on the HiSeq2500 rapid run flowcell (Illumina, San Diego, CA, USA). Demultiplexing was performed with CASAVA 1.8.2. (Illumina, San Diego, CA, USA) allowing one mismatch in the index reads.</p>
</sec>
<sec><title><italic>Escherichia coli</italic>
 genome assembly and multi-locus sequence typing</title>
<p>Genomes were assembled with the <italic>de novo</italic>
 assembly program Velvet (<xref rid="ref-23" ref-type="bibr">Zerbino & Birney, 2008</xref>
). Prior to assembling, the reads were trimmed and filtered for quality (<italic>fastq_quality_trimmer–Q33–t 30–l 40, fastq_quality_filter–Q33–q 25–p 90</italic>
) (<uri xlink:href="http://hannolab.cshl.edu/fastx_toolkit/">http://hannolab.cshl.edu/fastx_toolkit/</uri>
). The cyclic assembly process was applied for each genome where different Velvet parameter values (<italic>−exp_cov, −cov_cutoff (3, 5, 10, 15), −min_pair_count (1–5), −ins_length (100–350)</italic>
) were tested until all MLST genes were found or the best set of MLST genes was retrieved. For accurate MLST type identification, we used the assembled <italic>E. coli</italic>
 genomes and a MLST tool published by <xref rid="ref-10" ref-type="bibr">Larsen et al. (2012)</xref>
 that calculates the MLST profile based on a BLAST (<xref rid="ref-2" ref-type="bibr">Altschul et al., 1997</xref>
) alignment of the input sequence file and the specified allele set. Public <italic>E. coli</italic>
 database “#1” version 2014_01 for molecular typing was downloaded from PubMLST (<uri xlink:href="http://www.pubmlst.org/">http://www.pubmlst.org/</uri>
).</p>
</sec>
<sec><title>Data sets used to assess the performance of StrainSeeker</title>
<p>We randomly selected 100 strains from 667 <italic>E. coli</italic>
 samples (<xref ref-type="supplementary-material" rid="supp-3">Table S1</xref>
). Assembled genomes of these strains were also included in the <italic>E. coli</italic>
 “gene alignment-based reference tree” (shown in <xref ref-type="supplementary-material" rid="supp-2">Fig. S2</xref>
) that we used to assess the results of StrainSeeker (<xref ref-type="fig" rid="fig-4">Fig. 4</xref>
).</p>
<p>In order to test the identification speed of the programs, we downloaded raw reads of three bacterial species from the European Nucleotide Archive (study accession <uri xlink:href="http://www.ebi.ac.uk/ena/data/view/PRJEB8647">PRJEB8647</uri>
, run accession numbers <uri xlink:href="http://www.ebi.ac.uk/ena/data/view/ERR769199">ERR769199</uri>
 [<italic>Enterococcus faecium</italic>
], <uri xlink:href="http://www.ebi.ac.uk/ena/data/view/ERR769279">ERR769279</uri>
 [<italic>Enterococcus faecalis</italic>
] and <uri xlink:href="http://www.ebi.ac.uk/ena/data/view/ERR769315">ERR769315</uri>
 [<italic>Salmonella enterica</italic>
]) and also used raw reads of a randomly selected <italic>K. pneumoniae</italic>
 and <italic>E. coli</italic>
 isolate. Raw reads of 100 <italic>E. coli</italic>
 test strains and the <italic>Klebsiella pneumoniae</italic>
 isolate are available at the European Nucleotide Archive (study accession <uri xlink:href="http://www.ebi.ac.uk/ena/data/view/PRJEB20419">PRJEB20419</uri>
).</p>
</sec>
</sec>
<sec sec-type="results"><title>Results</title>
<sec><title>Using StrainSeeker to predict the clades of <italic>E. coli</italic>
 isolates</title>
<p>To determine the correct clade for each of the 100 <italic>E. coli</italic>
 strains used in the test, we built the “gene alignment-based reference tree” (shown in <xref ref-type="supplementary-material" rid="supp-2">Fig. S2</xref>
) that included sequences of 100 test strains (<xref ref-type="supplementary-material" rid="supp-3">Table S1</xref>
) and sequences of 74 <italic>E. coli</italic>
 strains obtained from the NCBI RefSeq database. Clade threshold was set to 0.001 nucleotide substitutions per site. Tests run using databases based on three guide trees (see Methods)—the “gene-alignment-based guide tree” and the “Mash-based guide tree” contained 74 <italic>E. coli</italic>
 RefSeq strains (<xref ref-type="fig" rid="fig-5">Figs. 5A</xref>
 and <xref ref-type="fig" rid="fig-5">5B</xref>
) and the “Large Mash-based guide tree” contained 4,324 bacterial and archaeal strains obtained from the NCBI RefSeq database (<xref ref-type="fig" rid="fig-5">Fig. 5C</xref>
). We used the “gene alignment-based guide tree” as the positive control as we expected it to be the most accurate approximation of phylogenetic relationships. Versions with <italic>k</italic>
-mer lengths 14–32 (16–32 in the case of the full NCBI bacteria database) were made from the databases. We counted the samples in which StrainSeeker made any of the three types of error (<xref ref-type="fig" rid="fig-4">Fig. 4B</xref>
). None of the isolates were assigned to incorrect clades. StrainSeeker’s ability to correctly predict the isolates’ clade increased with lower <italic>k</italic>
 values, mainly because its search process (<xref ref-type="fig" rid="fig-2">Fig. 2</xref>
) was more likely to stop prematurely in the case of higher <italic>k</italic>
. However, in the case of <italic>k</italic>
 = 16 (<italic>k</italic>
 = 15 in the case of the small database), the number of ambiguous results increased. Therefore, shorter <italic>k</italic>
-mers (16–20) are useful to avoid premature termination of the search process and longer <italic>k</italic>
-mers (28–32) will prevent some of the ambiguous identifications. If a premature search stop is not a problem, longer <italic>k</italic>
-mers can be used. StrainSeeker made more errors using the “gene alignment-based guide tree” compared to the “Mash-based guide tree,” in the case of <italic>k</italic>
 = 16 and higher values. Using the database based on the “Large Mash-based guide tree,” <italic>k</italic>
 = 16 and clade distance 0.001, StrainSeeker made less errors than with the other databases and its accuracy was 92% (<xref ref-type="fig" rid="fig-5">Fig. 5C</xref>
).</p>
<fig id="fig-5" orientation="portrait" position="float"><object-id pub-id-type="doi">10.7717/peerj.3353/fig-5</object-id>
<label>Figure 5</label>
<caption><title><italic>K</italic>
-mer length effects on StrainSeeker results.</title>
<p>A total of 100 <italic>E. coli</italic>
 samples were identified with StrainSeeker using databases based on three guide trees with <italic>k</italic>
-mer lengths ranging from 14 to 32 (16 to 32 in case of (C), see Methods). (A) Identifications made with the database containing 74 <italic>E. coli</italic>
 strains. (B) Identifications made with the database containing 74 <italic>E. coli</italic>
 strains. (C) Identifications made with the 4,324 strains database.</p>
</caption>
<graphic xlink:href="peerj-05-3353-g005"></graphic>
</fig>
<p>To see how well StrainSeeker can assign the clades of closely related <italic>E. coli</italic>
 strains, we tested the accuracy of StrainSeeker with smaller clades down to distance 0.00001, using the database based on the “Large Mash-based guide tree” and <italic>k</italic>
 = 16. This resulted in several clades in which the test strains were in a clade without any reference <italic>E. coli</italic>
 strains, which makes validation of these strains impossible. For this reason, the number of strains that can be validated decreases with decreasing the clade distance. With clade thresholds of 0.0003, 0.0001, 0.00003 and 0.00001 nucleotide substitutions per site, StrainSeeker identified the clades of 80 <italic>E. coli</italic>
 isolates with 90%, 69 isolates with 91%, 58 isolates with 90% and 46 isolates with 76% accuracy, respectively.</p>
</sec>
<sec><title>Minimum amount of reads required for clade prediction</title>
<p>To determine the required coverage for accurate isolate clade prediction using StrainSeeker, we used the same 100 <italic>E. coli</italic>
 test samples as above, but lowered the amount of reads analyzed. We used the large database, based on the “Large Mash-based guide tree” and <italic>k</italic>
 = 16. It can be seen that the number of results without any errors increases with the number of reads and the predictions are accurate if at least 25,000 reads from given strain are present (<xref ref-type="fig" rid="fig-6">Fig. 6</xref>
). This indicates that in order to predict the clade of an isolate, sequencing with low coverage is sufficient and many more samples could be sequenced simultaneously in a single sequencing run.</p>
<fig id="fig-6" orientation="portrait" position="float"><object-id pub-id-type="doi">10.7717/peerj.3353/fig-6</object-id>
<label>Figure 6</label>
<caption><title>Minimum amount of reads required for isolate identification.</title>
<p>The line shows the percentage of results without any types of error (<xref ref-type="fig" rid="fig-4">Fig. 4B</xref>
) obtained while sequencing a certain number of 101 bp Illumina reads. We used the large database, containing 4,324 strains, based on the “Large Mash-based guide tree” and <italic>k</italic>
 = 16.</p>
</caption>
<graphic xlink:href="peerj-05-3353-g006"></graphic>
</fig>
</sec>
<sec><title>StrainSeeker’s performance compared to other identification tools</title>
<p>We compared three other bacterial identification tools (Sigma, Reads2Type and Kraken) with StrainSeeker (<xref ref-type="table" rid="table-1">Table 1</xref>
). Kraken (<xref rid="ref-22" ref-type="bibr">Wood & Salzberg, 2014</xref>
) classifies each sequencing read using exact <italic>k</italic>
-mer matching, Sigma (<xref rid="ref-1" ref-type="bibr">Ahn, Chai & Pan, 2015</xref>
) aligns reads to reference genomes and Reads2Type (<xref rid="ref-18" ref-type="bibr">Saputra et al., 2015</xref>
) uses species-specific markers. Kraken and Sigma are designed to work on a UNIX server, Reads2Type is a web-based tool. All programs except Reads2Type were tested using a UNIX server, 1 CPU core and 512 GB total RAM.</p>
<table-wrap id="table-1" orientation="portrait" position="float"><object-id pub-id-type="doi">10.7717/peerj.3353/table-1</object-id>
<label>Table 1</label>
<caption><title>Speed comparison of StrainSeeker, Kraken, Sigma and Reads2Type.</title>
<p>We used the “minikraken” database with Kraken, “fast” mode with Reads2Type and the 4,324 strain database based on the “Mash-based guide tree” and <italic>k</italic>
 = 16 in case of StrainSeeker.</p>
</caption>
<alternatives><graphic xlink:href="peerj-05-3353-g007"></graphic>
<table frame="hsides" rules="groups" content-type="text"><colgroup span="1"><col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
</colgroup>
<thead><tr><th rowspan="2" colspan="1">Species</th>
<th rowspan="2" colspan="1">Read count (M)</th>
<th rowspan="2" colspan="1">Read length (bp)</th>
<th rowspan="2" colspan="1">Coverage</th>
<th colspan="4" rowspan="1">Identification time (min)</th>
</tr>
<tr><th rowspan="1" colspan="1">StrainSeeker</th>
<th rowspan="1" colspan="1">Sigma</th>
<th rowspan="1" colspan="1">Reads2Type</th>
<th rowspan="1" colspan="1">Kraken</th>
</tr>
</thead>
<tbody><tr><td rowspan="1" colspan="1"><italic>Escherichia coli</italic>
</td>
<td rowspan="1" colspan="1">1.39</td>
<td rowspan="1" colspan="1">101</td>
<td rowspan="1" colspan="1">28×</td>
<td rowspan="1" colspan="1">1.1</td>
<td rowspan="1" colspan="1">891.2</td>
<td rowspan="1" colspan="1">2.8</td>
<td rowspan="1" colspan="1">1.1</td>
</tr>
<tr><td rowspan="1" colspan="1"><italic>Klebsiella pneumoniae</italic>
</td>
<td rowspan="1" colspan="1">2.37</td>
<td rowspan="1" colspan="1">101</td>
<td rowspan="1" colspan="1">46×</td>
<td rowspan="1" colspan="1">1.1</td>
<td rowspan="1" colspan="1">1303.6</td>
<td rowspan="1" colspan="1">3.3</td>
<td rowspan="1" colspan="1">1.7</td>
</tr>
<tr><td rowspan="1" colspan="1"><italic>Enterococcus faecalis</italic>
</td>
<td rowspan="1" colspan="1">4.03</td>
<td rowspan="1" colspan="1">96</td>
<td rowspan="1" colspan="1">138×</td>
<td rowspan="1" colspan="1">1.1</td>
<td rowspan="1" colspan="1">2065.0</td>
<td rowspan="1" colspan="1">0.6</td>
<td rowspan="1" colspan="1">2.8</td>
</tr>
<tr><td rowspan="1" colspan="1"><italic>Enterococcus faecium</italic>
</td>
<td rowspan="1" colspan="1">4.44</td>
<td rowspan="1" colspan="1">96</td>
<td rowspan="1" colspan="1">147×</td>
<td rowspan="1" colspan="1">1.1</td>
<td rowspan="1" colspan="1">2211.4</td>
<td rowspan="1" colspan="1">NA</td>
<td rowspan="1" colspan="1">3.0</td>
</tr>
<tr><td rowspan="1" colspan="1"><italic>Salmonella enterica</italic>
</td>
<td rowspan="1" colspan="1">4.94</td>
<td rowspan="1" colspan="1">96</td>
<td rowspan="1" colspan="1">101×</td>
<td rowspan="1" colspan="1">1.2</td>
<td rowspan="1" colspan="1">2431.6</td>
<td rowspan="1" colspan="1">0.2</td>
<td rowspan="1" colspan="1">3.1</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot><fn id="table-1fn"><p><bold>Notes:</bold>
</p>
</fn>
<fn id="table-1fn1" fn-type="other"><p>Final size for the large, 4,324 strain database based on the “Mash-based guide tree” ranged from 4 to 18 GB (<italic>k</italic>
 = 16 and <italic>k</italic>
 = 32, respectively). If disk space for database building process is a constraint, databases can be downloaded from <uri xlink:href="http://bioinfo.ut.ee/strainseeker">http://bioinfo.ut.ee/strainseeker</uri>
 or figshare (<uri xlink:href="https://figshare.com/s/453ab0fb39ba6a06f91d">https://figshare.com/s/453ab0fb39ba6a06f91d</uri>
). StrainSeeker does not load the whole database in memory and we successfully tested it using a laptop with 8 GB of RAM.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>Accuracy was calculated based on the test set of 100 <italic>E. coli</italic>
 strains. All tools except Reads2Type were able to give multiple strains with different abundances as the output (<xref ref-type="supplementary-material" rid="supp-1">Fig. S1</xref>
), therefore it is not possible to use the same three types of errors as we used for describing StrainSeeker. Instead, we selected the strain with highest estimated abundance in the output file of each program and assessed whether it belonged to the correct clade or not. When measured by this method, we recorded the following results for identifying the correct strain in the set of 100 <italic>E. coli</italic>
 strains: StrainSeeker’s accuracy was 99% and Kraken’s accuracy 69%. Due to its excessive computing time (1,000× slower compared to other programs), we did not test the accuracy of Sigma. Reads2Type can identify samples at only species level (not at the strain level). Therefore, Reads2Type and Sigma were not used in this comparison.</p>
<p>Comparison of time spent on identification (<xref ref-type="table" rid="table-1">Table 1</xref>
) shows that Sigma spent several hours analyzing each sample, whereas the other tools took only a few minutes. This is mainly because read alignment is computationally more expensive than exact <italic>k</italic>
-mer matching (<xref rid="ref-22" ref-type="bibr">Wood & Salzberg, 2014</xref>
). Reads2Type identification speed varies and is not related to the sample size as it does not analyze all reads, but stops as soon as a read matches a unique probe. StrainSeeker scales well with large samples, taking almost the same amount of time for each sample. Identification results of all programs were correct on the species level, except that Reads2Type was unable to identify <italic>Enterococcus faecium</italic>
.</p>
</sec>
</sec>
<sec sec-type="discussion"><title>Discussion</title>
<p>It is paramount to identify novel strains because they can have very different phenotypes compared to their relatives. To solve this problem, we use a guide tree that allows us to narrow down the clade to which the isolate belongs. We decided not to use NCBI taxonomy because it did not contain strain-level relationships, making it unsuitable for the identification of clades within a species. Also in the NCBI tree, some taxons are not monophyletic, e.g., the <italic>Shigella/Escherichia</italic>
 branch.</p>
<p>Contrary to other <italic>k</italic>
-mer-based programs identifying bacteria (<xref rid="ref-11" ref-type="bibr">Lindgreen, Adair & Gardner, 2016</xref>
), StrainSeeker checks which of the specific <italic>k</italic>
-mers are found in the sample, instead of each sequencing read. This makes StrainSeeker less vulnerable to errors if there are many <italic>k</italic>
-mers in the sample (due to technical or biological reasons) which, according to the database, are specific to a species present in the database, but in fact, originate from another species not represented in the database. Also identifying each read separately could give a distorted result if the exact isolate is not present in the database. In such case, reads will be assigned to multiple bacterial genomes and the user cannot know if the sample contained an unknown strain or multiple related strains.</p>
<p>In the present work, we tested the performance of StrainSeeker using two different guide trees. One was based on an alignment of <italic>E. coli</italic>
 shared genes and included 74 <italic>E. coli</italic>
 strains, the other used a <italic>k</italic>
-mer-based distance method (<xref rid="ref-14" ref-type="bibr">Ondov et al., 2016</xref>
) and consisted of 4,324 bacterial and archaeal strains. StrainSeeker proved to be highly accurate in clade prediction especially with <italic>k</italic>
 values ranging from 16 to 20. Lower <italic>k</italic>
 values resulted in incorrect branches being identified along with the correct clade. This could be because of the sequencing errors as shorter <italic>k</italic>
-mers are more likely assigned to wrong nodes due to errors than longer <italic>k</italic>
-mers. Values for <italic>k</italic>
 higher than 20 are not recommended for clade prediction as StrainSeeker’s search process is more likely to stop prematurely. One reason for this is the total number of node-specific <italic>k</italic>
-mers, which increases with higher <italic>k</italic>
 values as longer <italic>k</italic>
-mers are more specific, but also more likely to contain sequencing errors.</p>
<p>In order to correctly predict phenotypic traits of an isolate, such as mutations conferring resistance to antibiotics, at least 10 times (10×) coverage is necessary (<xref rid="ref-5" ref-type="bibr">Inouye et al., 2014</xref>
; <xref rid="ref-3" ref-type="bibr">Bradley et al., 2015</xref>
). In our study, we demonstrated that the minimum amount of sequencing coverage required for accurate clade prediction is less than 1× in the case of <italic>E. coli</italic>
. Based on this knowledge, multiple samples could be sequenced in a single run, saving resources and increasing throughput. This could be useful in all cases in which knowing only the clade of the strain would be sufficient, such as large-scale screening for known pathogenic bacteria.</p>
<p>Due to the statistical framework of StrainSeeker, it has some limitations. First, it is not able to differentiate between strains that are distinguished by only a few single nucleotide variations and may not be useful in detecting clinically relevant mutations and alleles. This requires high coverage and is a task more suited to tools like Mykrobe and SRST2. StrainSeeker is not meant to compete with such programs, but mainly to complement them. Second, only high-quality assembled genomes can be used as an input for StrainSeeker database building.</p>
</sec>
<sec sec-type="conclusions"><title>Conclusion</title>
<p>There is a strong need for the fast detection of bacterial strains. StrainSeeker can detect strain sequences missing from public databases and identify the clade where the isolate belongs to. In the current study, we showed that StrainSeeker accurately and rapidly identifies the clades of 100 <italic>E. coli</italic>
 isolates. By using bacterial genome sequences from large public databases such as the NCBI RefSeq database, users do not have to build separate databases for each species of interest. Also StrainSeeker does not require high coverage for accurate clade prediction. For users who are not able to use the UNIX environment, there is an online version of StrainSeeker available at <uri xlink:href="http://bioinfo.ut.ee/strainseeker/">http://bioinfo.ut.ee/strainseeker/</uri>
.</p>
</sec>
<sec sec-type="supplementary-material" id="supplemental-information"><title>Supplemental Information</title>
<supplementary-material content-type="local-data" id="supp-1"><object-id pub-id-type="doi">10.7717/peerj.3353/supp-1</object-id>
<label>Supplemental Information 1</label>
<caption><title>Output formats of StrainSeeker, Kraken, Sigma and Reads2Type.</title>
<p>All tools were used to identify an <italic>E. coli</italic>
 isolate with multi-locus sequence type 131. According to our “gene alignment-based reference tree,” this strain was very similar to <italic>E. coli</italic>
 strain JJ1886. (<bold>A</bold>
) StrainSeeker output is either given as a tab-delimited text or a pie chart with each strain relative abundance. Text format shows whether the identified strain was the same strain as the database reference strain (“KNOWN”) or related to it (“RELATED”). From the results, it can be seen that a single strain related to <italic>E. coli</italic>
 JJ1886 was found. (<bold>B</bold>
) Kraken output is given as a tab-delimited text file with read numbers that were assigned to each taxonomic rank. <italic>E. coli</italic>
 JJ1886 is the strain with highest number of assigned reads, closely followed by O7:K1 which has the sequence type 62. (<bold>C</bold>
) Sigma gives a html-format result which can be visualized in a web browser. <italic>E. coli</italic>
 JJ1886 has the highest percentage in the sample. (<bold>D</bold>
) Reads2Type can only be used as a web tool and it gives a species-level result directly in the web browser.</p>
</caption>
<media xlink:href="peerj-05-3353-s001.png"><caption><p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="supp-2"><object-id pub-id-type="doi">10.7717/peerj.3353/supp-2</object-id>
<label>Supplemental Information 2</label>
<caption><title>Gene alignment-based reference tree of <italic>E. coli</italic>
 strains.</title>
<p>Each of the 74 NCBI RefSeq reference strain name is given as follows: [Multi-locus sequence type] [Strain name] [RefSeq identifier] [NCBI accession number]. The other 100 strains are the strains used in performance tests. The tree shown is the “gene alignment-based reference tree” (see Methods). Clades are limited by a maximum difference of 0.002 nucleotide substitutions per site between strains.</p>
</caption>
<media xlink:href="peerj-05-3353-s002.png"><caption><p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="supp-3"><object-id pub-id-type="doi">10.7717/peerj.3353/supp-3</object-id>
<label>Supplemental Information 3</label>
<caption><title><italic>Escherichia coli</italic>
 isolates used in the StrainSeeker performance tests.</title>
</caption>
<media xlink:href="peerj-05-3353-s003.docx"><caption><p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="supp-4"><object-id pub-id-type="doi">10.7717/peerj.3353/supp-4</object-id>
<label>Supplemental Information 4</label>
<caption><title>126 <italic>E. coli</italic>
 shared genes.</title>
<p>A list of all the 126 protein-coding genes shared between all <italic>E. coli</italic>
 test isolates and reference E. coli strains.</p>
</caption>
<media xlink:href="peerj-05-3353-s004.docx"><caption><p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="supp-5"><object-id pub-id-type="doi">10.7717/peerj.3353/supp-5</object-id>
<label>Supplemental Information 5</label>
<caption><title>A thorough description of the statistical test that is part of the StrainSeeker identification algorithm.</title>
<p>The description of the statistical test that is part of the StrainSeeker identification algorithm along with its derivation and all the notations used.</p>
</caption>
<media xlink:href="peerj-05-3353-s005.pdf"><caption><p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back><glossary content-type="abbreviations" id="glossary-1"><title>List of abbreviations</title>
<def-list><def-item><term>bp</term>
<def><p>base pair</p>
</def>
</def-item>
<def-item><term>NCBI</term>
<def><p>National Center for Biotechnology Information</p>
</def>
</def-item>
<def-item><term>MLST</term>
<def><p>multi-locus sequence typing</p>
</def>
</def-item>
</def-list>
</glossary>
<sec sec-type="additional-information"><title>Additional Information and Declarations</title>
<fn-group content-type="competing-interests"><title>Competing Interests</title>
<fn fn-type="COI-statement" id="conflict-1"><p>Paul Naaber is an employee of Synlab Eesti, Tallinn, Estonia.</p>
</fn>
</fn-group>
<fn-group content-type="author-contributions"><title>Author Contributions</title>
<fn fn-type="con" id="contribution-1"><p><xref ref-type="contrib" rid="author-1">Märt Roosaare</xref>
 conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables and reviewed drafts of the paper.</p>
</fn>
<fn fn-type="con" id="contribution-2"><p><xref ref-type="contrib" rid="author-2">Mihkel Vaher</xref>
 conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables and reviewed drafts of the paper.</p>
</fn>
<fn fn-type="con" id="contribution-3"><p><xref ref-type="contrib" rid="author-3">Lauris Kaplinski</xref>
 contributed reagents/materials/analysis tools and reviewed drafts of the paper.</p>
</fn>
<fn fn-type="con" id="contribution-4"><p><xref ref-type="contrib" rid="author-4">Märt Möls</xref>
 contributed reagents/materials/analysis tools and prepared figures and/or tables.</p>
</fn>
<fn fn-type="con" id="contribution-5"><p><xref ref-type="contrib" rid="author-5">Reidar Andreson</xref>
 contributed reagents/materials/analysis tools.</p>
</fn>
<fn fn-type="con" id="contribution-6"><p><xref ref-type="contrib" rid="author-6">Maarja Lepamets</xref>
 contributed reagents/materials/analysis tools.</p>
</fn>
<fn fn-type="con" id="contribution-7"><p><xref ref-type="contrib" rid="author-7">Triinu Kõressaar</xref>
 contributed reagents/materials/analysis tools and reviewed drafts of the paper.</p>
</fn>
<fn fn-type="con" id="contribution-8"><p><xref ref-type="contrib" rid="author-8">Paul Naaber</xref>
 contributed reagents/materials/analysis tools.</p>
</fn>
<fn fn-type="con" id="contribution-9"><p><xref ref-type="contrib" rid="author-9">Siiri Kõljalg</xref>
 contributed reagents/materials/analysis tools.</p>
</fn>
<fn fn-type="con" id="contribution-10"><p><xref ref-type="contrib" rid="author-10">Maido Remm</xref>
 conceived and designed the experiments and reviewed drafts of the paper.</p>
</fn>
</fn-group>
<fn-group content-type="other"><title>DNA Deposition</title>
<fn id="addinfo-1"><p>The following information was supplied regarding the deposition of DNA sequences:</p>
<p>Raw reads of 100 <italic>E. coli</italic>
 test strains and the <italic>Klebsiella pneumoniae</italic>
 isolate are available at the European Nucleotide Archive (study accession <uri xlink:href="http://www.ebi.ac.uk/ena/data/view/PRJEB20419">PRJEB20419</uri>
).</p>
</fn>
</fn-group>
<fn-group content-type="other"><title>Data Availability</title>
<fn id="addinfo-2"><p>The following information was supplied regarding data availability:</p>
<p>The StrainSeeker code is available at both GitHub: <uri xlink:href="https://github.com/bioinfo-ut">https://github.com/bioinfo-ut</uri>
 and the department web server: <uri xlink:href="http://bioinfo.ut.ee/strainseeker/">http://bioinfo.ut.ee/strainseeker/</uri>
.</p>
<p>Roosaare, Märt (2017): StrainSeeker databases. figshare.</p>
<p><uri xlink:href="https://doi.org/10.6084/m9.figshare.c.3750794.v1">https://doi.org/10.6084/m9.figshare.c.3750794.v1</uri>
.</p>
</fn>
</fn-group>
</sec>
<ref-list content-type="authoryear"><title>References</title>
<ref id="ref-1"><label>Ahn, Chai & Pan (2015)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Ahn</surname>
<given-names>TH</given-names>
</name>
<name><surname>Chai</surname>
<given-names>J</given-names>
</name>
<name><surname>Pan</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Sigma: strain-level inference of genomes from metagenomic analysis for biosurveillance</article-title>
<source>Bioinformatics</source>
<year>2015</year>
<volume>31</volume>
<issue>2</issue>
<fpage>170</fpage>
<lpage>177</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btu641</pub-id>
<pub-id pub-id-type="pmid">25266224</pub-id>
</element-citation>
</ref>
<ref id="ref-2"><label>Altschul et al. (1997)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name><surname>Madden</surname>
<given-names>TL</given-names>
</name>
<name><surname>Schäffer</surname>
<given-names>AA</given-names>
</name>
<name><surname>Zhang</surname>
<given-names>J</given-names>
</name>
<name><surname>Zhang</surname>
<given-names>Z</given-names>
</name>
<name><surname>Miller</surname>
<given-names>W</given-names>
</name>
<name><surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
</person-group>
<article-title>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs</article-title>
<source>Nucleic Acids Research</source>
<year>1997</year>
<volume>25</volume>
<issue>17</issue>
<fpage>3389</fpage>
<lpage>3402</lpage>
<pub-id pub-id-type="doi">10.1093/nar/25.17.3389</pub-id>
<pub-id pub-id-type="pmid">9254694</pub-id>
</element-citation>
</ref>
<ref id="ref-3"><label>Bradley et al. (2015)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bradley</surname>
<given-names>P</given-names>
</name>
<name><surname>Gordon</surname>
<given-names>NC</given-names>
</name>
<name><surname>Walker</surname>
<given-names>TM</given-names>
</name>
<name><surname>Dunn</surname>
<given-names>L</given-names>
</name>
<name><surname>Heys</surname>
<given-names>S</given-names>
</name>
<name><surname>Huang</surname>
<given-names>B</given-names>
</name>
<name><surname>Earle</surname>
<given-names>S</given-names>
</name>
<name><surname>Pankhurst</surname>
<given-names>LJ</given-names>
</name>
<name><surname>Anson</surname>
<given-names>L</given-names>
</name>
<name><surname>de Cesare</surname>
<given-names>M</given-names>
</name>
<name><surname>Piazza</surname>
<given-names>P</given-names>
</name>
<name><surname>Votintseva</surname>
<given-names>AA</given-names>
</name>
<name><surname>Golubchik</surname>
<given-names>T</given-names>
</name>
<name><surname>Wilson</surname>
<given-names>DJ</given-names>
</name>
<name><surname>Wyllie</surname>
<given-names>DH</given-names>
</name>
<name><surname>Diel</surname>
<given-names>R</given-names>
</name>
<name><surname>Niemann</surname>
<given-names>S</given-names>
</name>
<name><surname>Feuerriegel</surname>
<given-names>S</given-names>
</name>
<name><surname>Kohl</surname>
<given-names>TA</given-names>
</name>
<name><surname>Ismail</surname>
<given-names>N</given-names>
</name>
<name><surname>Omar</surname>
<given-names>SV</given-names>
</name>
<name><surname>Smith</surname>
<given-names>EG</given-names>
</name>
<name><surname>Buck</surname>
<given-names>D</given-names>
</name>
<name><surname>McVean</surname>
<given-names>G</given-names>
</name>
<name><surname>Walker</surname>
<given-names>AS</given-names>
</name>
<name><surname>Peto</surname>
<given-names>T</given-names>
</name>
<name><surname>Crook</surname>
<given-names>D</given-names>
</name>
<name><surname>Iqbal</surname>
<given-names>Z</given-names>
</name>
</person-group>
<article-title>Rapid antibiotic resistance predictions from genome sequence data for <italic>S. aureus</italic>
 and <italic>M. tuberculosis</italic>
</article-title>
<source>Nature Communication</source>
<year>2015</year>
<volume>6</volume>
<fpage>10063</fpage>
<pub-id pub-id-type="doi">10.1038/ncomms10063</pub-id>
</element-citation>
</ref>
<ref id="ref-4"><label>Hasman et al. (2014)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hasman</surname>
<given-names>H</given-names>
</name>
<name><surname>Saputra</surname>
<given-names>D</given-names>
</name>
<name><surname>Sicheritz-Ponten</surname>
<given-names>T</given-names>
</name>
<name><surname>Lund</surname>
<given-names>O</given-names>
</name>
<name><surname>Svendsen</surname>
<given-names>CA</given-names>
</name>
<name><surname>Frimodt-Moller</surname>
<given-names>N</given-names>
</name>
<name><surname>Aarestrup</surname>
<given-names>FM</given-names>
</name>
</person-group>
<article-title>Rapid whole-genome sequencing for detection and characterization of microorganisms directly from clinical samples</article-title>
<source>Journal of Clinical Microbiology</source>
<year>2014</year>
<volume>52</volume>
<issue>1</issue>
<fpage>139</fpage>
<lpage>146</lpage>
<pub-id pub-id-type="doi">10.1128/jcm.02452-13</pub-id>
<pub-id pub-id-type="pmid">24172157</pub-id>
</element-citation>
</ref>
<ref id="ref-5"><label>Inouye et al. (2014)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Inouye</surname>
<given-names>M</given-names>
</name>
<name><surname>Dashnow</surname>
<given-names>H</given-names>
</name>
<name><surname>Raven</surname>
<given-names>L-A</given-names>
</name>
<name><surname>Schultz</surname>
<given-names>MB</given-names>
</name>
<name><surname>Pope</surname>
<given-names>BJ</given-names>
</name>
<name><surname>Tomita</surname>
<given-names>T</given-names>
</name>
<name><surname>Zobel</surname>
<given-names>J</given-names>
</name>
<name><surname>Holt</surname>
<given-names>KE</given-names>
</name>
</person-group>
<article-title>SRST2: Rapid genomic surveillance for public health and hospital microbiology labs</article-title>
<source>Genome Medicine</source>
<year>2014</year>
<volume>6</volume>
<issue>11</issue>
<fpage>90</fpage>
<pub-id pub-id-type="doi">10.1186/s13073-014-0090-6</pub-id>
<pub-id pub-id-type="pmid">25422674</pub-id>
</element-citation>
</ref>
<ref id="ref-6"><label>Kaplinski, Lepamets & Remm (2015)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Kaplinski</surname>
<given-names>L</given-names>
</name>
<name><surname>Lepamets</surname>
<given-names>M</given-names>
</name>
<name><surname>Remm</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>GenomeTester4: a toolkit for performing basic set operations—union, intersection and complement on <italic>k</italic>
-mer lists</article-title>
<source>Gigascience</source>
<year>2015</year>
<volume>4</volume>
<issue>1</issue>
<fpage>58</fpage>
<pub-id pub-id-type="doi">10.1186/s13742-015-0097-y</pub-id>
<pub-id pub-id-type="pmid">26640690</pub-id>
</element-citation>
</ref>
<ref id="ref-7"><label>Karamonová et al. (2013)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Karamonová</surname>
<given-names>L</given-names>
</name>
<name><surname>Junková</surname>
<given-names>P</given-names>
</name>
<name><surname>Mihalová</surname>
<given-names>D</given-names>
</name>
<name><surname>Javůrková</surname>
<given-names>B</given-names>
</name>
<name><surname>Fukal</surname>
<given-names>L</given-names>
</name>
<name><surname>Rauch</surname>
<given-names>P</given-names>
</name>
<name><surname>Blažková</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>The potential of matrix-assisted laser desorption/ionization time-of-flight mass spectrometry for the identification of biogroups of <italic>Cronobacter sakazakii</italic>
</article-title>
<source>Rapid Communications in Mass Spectrometry</source>
<year>2013</year>
<volume>27</volume>
<issue>3</issue>
<fpage>409</fpage>
<lpage>418</lpage>
<pub-id pub-id-type="doi">10.1002/rcm.6464</pub-id>
<pub-id pub-id-type="pmid">23280972</pub-id>
</element-citation>
</ref>
<ref id="ref-8"><label>Katoh et al. (2002)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Katoh</surname>
<given-names>K</given-names>
</name>
<name><surname>Misawa</surname>
<given-names>K</given-names>
</name>
<name><surname>Kuma</surname>
<given-names>K</given-names>
</name>
<name><surname>Miyata</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform</article-title>
<source>Nucleic Acids Research</source>
<year>2002</year>
<volume>30</volume>
<issue>14</issue>
<fpage>3059</fpage>
<lpage>3066</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkf436</pub-id>
<pub-id pub-id-type="pmid">12136088</pub-id>
</element-citation>
</ref>
<ref id="ref-9"><label>Lan & Reeves (2002)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lan</surname>
<given-names>R</given-names>
</name>
<name><surname>Reeves</surname>
<given-names>PR</given-names>
</name>
</person-group>
<article-title><italic>Escherichia coli</italic>
 in disguise: molecular origins of <italic>Shigella</italic>
</article-title>
<source>Microbes and Infection</source>
<year>2002</year>
<volume>4</volume>
<issue>11</issue>
<fpage>1125</fpage>
<lpage>1132</lpage>
<pub-id pub-id-type="doi">10.1016/s1286-4579(02)01637-4</pub-id>
<pub-id pub-id-type="pmid">12361912</pub-id>
</element-citation>
</ref>
<ref id="ref-10"><label>Larsen et al. (2012)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Larsen</surname>
<given-names>MV</given-names>
</name>
<name><surname>Cosentino</surname>
<given-names>S</given-names>
</name>
<name><surname>Rasmussen</surname>
<given-names>S</given-names>
</name>
<name><surname>Friis</surname>
<given-names>C</given-names>
</name>
<name><surname>Hasman</surname>
<given-names>H</given-names>
</name>
<name><surname>Marvig</surname>
<given-names>RL</given-names>
</name>
<name><surname>Jelsbak</surname>
<given-names>L</given-names>
</name>
<name><surname>Sicheritz-Pontén</surname>
<given-names>T</given-names>
</name>
<name><surname>Ussery</surname>
<given-names>DW</given-names>
</name>
<name><surname>Aarestrup</surname>
<given-names>FM</given-names>
</name>
<name><surname>Lund</surname>
<given-names>O</given-names>
</name>
</person-group>
<article-title>Multilocus sequence typing of total-genome-sequenced bacteria</article-title>
<source>Journal of Clinical Microbiology</source>
<year>2012</year>
<volume>50</volume>
<issue>4</issue>
<fpage>1355</fpage>
<lpage>1361</lpage>
<pub-id pub-id-type="doi">10.1128/jcm.06094-11</pub-id>
<pub-id pub-id-type="pmid">22238442</pub-id>
</element-citation>
</ref>
<ref id="ref-11"><label>Lindgreen, Adair & Gardner (2016)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lindgreen</surname>
<given-names>S</given-names>
</name>
<name><surname>Adair</surname>
<given-names>KL</given-names>
</name>
<name><surname>Gardner</surname>
<given-names>PP</given-names>
</name>
</person-group>
<article-title>An evaluation of the accuracy and speed of metagenome analysis tools</article-title>
<source>Scientific Reports</source>
<year>2016</year>
<volume>6</volume>
<fpage>19233</fpage>
<pub-id pub-id-type="doi">10.1038/srep19233</pub-id>
<pub-id pub-id-type="pmid">26778510</pub-id>
</element-citation>
</ref>
<ref id="ref-12"><label>Maiden (2006)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Maiden</surname>
<given-names>MCJ</given-names>
</name>
</person-group>
<article-title>Multilocus sequence typing of bacteria</article-title>
<source>Annual Review of Microbiology</source>
<year>2006</year>
<volume>60</volume>
<fpage>561</fpage>
<lpage>588</lpage>
<pub-id pub-id-type="doi">10.1146/annurev.micro.59.030804.121325</pub-id>
</element-citation>
</ref>
<ref id="ref-13"><label>Ogura et al. (2009)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Ogura</surname>
<given-names>Y</given-names>
</name>
<name><surname>Ooka</surname>
<given-names>T</given-names>
</name>
<name><surname>Iguchi</surname>
<given-names>A</given-names>
</name>
<name><surname>Toh</surname>
<given-names>H</given-names>
</name>
<name><surname>Asadulghani</surname>
<given-names>M</given-names>
</name>
<name><surname>Oshima</surname>
<given-names>K</given-names>
</name>
<name><surname>Kodama</surname>
<given-names>T</given-names>
</name>
<name><surname>Abe</surname>
<given-names>H</given-names>
</name>
<name><surname>Nakayama</surname>
<given-names>K</given-names>
</name>
<name><surname>Kurokawa</surname>
<given-names>K</given-names>
</name>
<name><surname>Tobe</surname>
<given-names>T</given-names>
</name>
<name><surname>Hattori</surname>
<given-names>M</given-names>
</name>
<name><surname>Hayashi</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Comparative genomics reveal the mechanism of the parallel evolution of O157 and non-O157 enterohemorrhagic <italic>Escherichia coli</italic>
</article-title>
<source>Proceedings of the National Academy of Sciences in the United States of America</source>
<year>2009</year>
<volume>106</volume>
<issue>42</issue>
<fpage>17939</fpage>
<lpage>17944</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.0903585106</pub-id>
</element-citation>
</ref>
<ref id="ref-14"><label>Ondov et al. (2016)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Ondov</surname>
<given-names>BD</given-names>
</name>
<name><surname>Treangen</surname>
<given-names>TJ</given-names>
</name>
<name><surname>Mallonee</surname>
<given-names>AB</given-names>
</name>
<name><surname>Bergman</surname>
<given-names>NH</given-names>
</name>
<name><surname>Koren</surname>
<given-names>S</given-names>
</name>
<name><surname>Phillippy</surname>
<given-names>AM</given-names>
</name>
</person-group>
<article-title>Fast genome and metagenome distance estimation using MinHash</article-title>
<source>Genome Biology</source>
<year>2016</year>
<volume>17</volume>
<fpage>132</fpage>
<pub-id pub-id-type="pmid">27323842</pub-id>
</element-citation>
</ref>
<ref id="ref-15"><label>Ounit et al. (2015)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Ounit</surname>
<given-names>R</given-names>
</name>
<name><surname>Wanamaker</surname>
<given-names>S</given-names>
</name>
<name><surname>Close</surname>
<given-names>TJ</given-names>
</name>
<name><surname>Lonardi</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative <italic>k</italic>
-mers</article-title>
<source>BMC Genomics</source>
<year>2015</year>
<volume>16</volume>
<issue>1</issue>
<fpage>236</fpage>
<pub-id pub-id-type="doi">10.1186/s12864-015-1419-2</pub-id>
<pub-id pub-id-type="pmid">25879410</pub-id>
</element-citation>
</ref>
<ref id="ref-16"><label>Peabody et al. (2015)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Peabody</surname>
<given-names>MA</given-names>
</name>
<name><surname>Van Rossum</surname>
<given-names>T</given-names>
</name>
<name><surname>Lo</surname>
<given-names>R</given-names>
</name>
<name><surname>Brinkman</surname>
<given-names>FSL</given-names>
</name>
</person-group>
<article-title>Evaluation of shotgun metagenomics sequence classification methods using in silico and in vitro simulated communities</article-title>
<source>BMC Bioinformatics</source>
<year>2015</year>
<volume>16</volume>
<issue>1</issue>
<fpage>363</fpage>
<pub-id pub-id-type="doi">10.1186/s12859-015-0788-5</pub-id>
</element-citation>
</ref>
<ref id="ref-17"><label>Petty et al. (2014)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Petty</surname>
<given-names>NK</given-names>
</name>
<name><surname>Ben Zakour</surname>
<given-names>NL</given-names>
</name>
<name><surname>Stanton-Cook</surname>
<given-names>M</given-names>
</name>
<name><surname>Skippington</surname>
<given-names>E</given-names>
</name>
<name><surname>Totsika</surname>
<given-names>M</given-names>
</name>
<name><surname>Forde</surname>
<given-names>BM</given-names>
</name>
<name><surname>Phan</surname>
<given-names>M-D</given-names>
</name>
<name><surname>Gomes Moriel</surname>
<given-names>D</given-names>
</name>
<name><surname>Peters</surname>
<given-names>KM</given-names>
</name>
<name><surname>Davies</surname>
<given-names>M</given-names>
</name>
<name><surname>Rogers</surname>
<given-names>BA</given-names>
</name>
<name><surname>Dougan</surname>
<given-names>G</given-names>
</name>
<name><surname>Rodriguez-Baño</surname>
<given-names>J</given-names>
</name>
<name><surname>Pascual</surname>
<given-names>A</given-names>
</name>
<name><surname>Pitout</surname>
<given-names>JDD</given-names>
</name>
<name><surname>Upton</surname>
<given-names>M</given-names>
</name>
<name><surname>Paterson</surname>
<given-names>DL</given-names>
</name>
<name><surname>Walsh</surname>
<given-names>TR</given-names>
</name>
<name><surname>Schembri</surname>
<given-names>MA</given-names>
</name>
<name><surname>Beatson</surname>
<given-names>SA</given-names>
</name>
</person-group>
<article-title>Global dissemination of a multidrug resistant <italic>Escherichia coli</italic>
 clone</article-title>
<source>Proceedings of the National Academy of Sciences of the United States of America</source>
<year>2014</year>
<volume>111</volume>
<issue>15</issue>
<fpage>5694</fpage>
<lpage>5699</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.1322678111</pub-id>
<pub-id pub-id-type="pmid">24706808</pub-id>
</element-citation>
</ref>
<ref id="ref-18"><label>Saputra et al. (2015)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Saputra</surname>
<given-names>D</given-names>
</name>
<name><surname>Rasmussen</surname>
<given-names>S</given-names>
</name>
<name><surname>Larsen</surname>
<given-names>MV</given-names>
</name>
<name><surname>Haddad</surname>
<given-names>N</given-names>
</name>
<name><surname>Sperotto</surname>
<given-names>MM</given-names>
</name>
<name><surname>Aarestrup</surname>
<given-names>FM</given-names>
</name>
<name><surname>Lund</surname>
<given-names>O</given-names>
</name>
<name><surname>Sicheritz-Pontén</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Reads2Type: a web application for rapid microbial taxonomy identification</article-title>
<source>BMC Bioinformatics</source>
<year>2015</year>
<volume>16</volume>
<issue>1</issue>
<fpage>398</fpage>
<pub-id pub-id-type="doi">10.1186/s12859-015-0829-0</pub-id>
<pub-id pub-id-type="pmid">26608174</pub-id>
</element-citation>
</ref>
<ref id="ref-19"><label>Steiner et al. (2014)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Steiner</surname>
<given-names>A</given-names>
</name>
<name><surname>Stucki</surname>
<given-names>D</given-names>
</name>
<name><surname>Coscolla</surname>
<given-names>M</given-names>
</name>
<name><surname>Borrell</surname>
<given-names>S</given-names>
</name>
<name><surname>Gagneux</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>KvarQ: targeted and direct variant calling from fastq reads of bacterial genomes</article-title>
<source>BMC Genomics</source>
<year>2014</year>
<volume>15</volume>
<issue>1</issue>
<fpage>881</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2164-15-881</pub-id>
<pub-id pub-id-type="pmid">25297886</pub-id>
</element-citation>
</ref>
<ref id="ref-20"><label>Tamura et al. (2013)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tamura</surname>
<given-names>K</given-names>
</name>
<name><surname>Stecher</surname>
<given-names>G</given-names>
</name>
<name><surname>Peterson</surname>
<given-names>D</given-names>
</name>
<name><surname>Filipski</surname>
<given-names>A</given-names>
</name>
<name><surname>Kumar</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>MEGA6: Molecular Evolutionary Genetics Analysis version 6.0</article-title>
<source>Molecular Biology and Evolution</source>
<year>2013</year>
<volume>30</volume>
<issue>12</issue>
<fpage>2725</fpage>
<lpage>2729</lpage>
<pub-id pub-id-type="doi">10.1093/molbev/mst197</pub-id>
<pub-id pub-id-type="pmid">24132122</pub-id>
</element-citation>
</ref>
<ref id="ref-21"><label>Tu, He & Zhou (2014)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tu</surname>
<given-names>Q</given-names>
</name>
<name><surname>He</surname>
<given-names>Z</given-names>
</name>
<name><surname>Zhou</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Strain/species identification in metagenomes using genome-specific markers</article-title>
<source>Nucleic Acids Research</source>
<year>2014</year>
<volume>42</volume>
<issue>8</issue>
<fpage>1</fpage>
<lpage>12</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gku138</pub-id>
<pub-id pub-id-type="pmid">24376271</pub-id>
</element-citation>
</ref>
<ref id="ref-22"><label>Wood & Salzberg (2014)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wood</surname>
<given-names>DE</given-names>
</name>
<name><surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>Kraken: ultrafast metagenomic sequence classification using exact alignments</article-title>
<source>Genome Biology</source>
<year>2014</year>
<volume>15</volume>
<issue>3</issue>
<fpage>R46</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2014-15-3-r46</pub-id>
<pub-id pub-id-type="pmid">24580807</pub-id>
</element-citation>
</ref>
<ref id="ref-23"><label>Zerbino & Birney (2008)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Zerbino</surname>
<given-names>DR</given-names>
</name>
<name><surname>Birney</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Velvet: algorithms for de novo short read assembly using de Bruijn graphs</article-title>
<source>Genome Research</source>
<year>2008</year>
<volume>18</volume>
<fpage>821</fpage>
<lpage>829</lpage>
<pub-id pub-id-type="doi">10.1101/gr.074492.107</pub-id>
<pub-id pub-id-type="pmid">18349386</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001118 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 001118 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:5438578
   |texte=   StrainSeeker: fast identification of bacterial strains from raw sequencing reads using user-provided guide trees
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:28533988" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration MERS

StrainSeeker: fast identification of bacterial strains from raw sequencing reads using user-provided guide trees

StrainSeeker: fast identification of bacterial strains from raw sequencing reads using user-provided guide trees

Source :

Abstract

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki