MersV1, Pmc, Corpus, bibRecord, 000F87

A k-mer-based method for the identification of phenotype-associated genomic biomarkers and predicting phenotypes of sequenced bacteria

Identifieur interne : 000F87 ( Pmc/Corpus ); précédent : 000F86; suivant : 000F88

A k-mer-based method for the identification of phenotype-associated genomic biomarkers and predicting phenotypes of sequenced bacteria

Auteurs : Erki Aun ; Age Brauer ; Veljo Kisand ; Tanel Tenson ; Maido Remm

Source :

PLoS Computational Biology [ 1553-734X ] ; 2018.

RBID : PMC:6211763

Abstract

We have developed an easy-to-use and memory-efficient method called PhenotypeSeeker that (a) identifies phenotype-specific k-mers, (b) generates a k-mer-based statistical model for predicting a given phenotype and (c) predicts the phenotype from the sequencing data of a given bacterial isolate. The method was validated on 167 Klebsiella pneumoniae isolates (virulence), 200 Pseudomonas aeruginosa isolates (ciprofloxacin resistance) and 459 Clostridium difficile isolates (azithromycin resistance). The phenotype prediction models trained from these datasets obtained the F1-measure of 0.88 on the K. pneumoniae test set, 0.88 on the P. aeruginosa test set and 0.97 on the C. difficile test set. The F1-measures were the same for assembled sequences and raw sequencing data; however, building the model from assembled genomes is significantly faster. On these datasets, the model building on a mid-range Linux server takes approximately 3 to 5 hours per phenotype if assembled genomes are used and 10 hours per phenotype if raw sequencing data are used. The phenotype prediction from assembled genomes takes less than one second per isolate. Thus, PhenotypeSeeker should be well-suited for predicting phenotypes from large sequencing datasets. PhenotypeSeeker is implemented in Python programming language, is open-source software and is available at GitHub (https://github.com/bioinfo-ut/PhenotypeSeeker/).

Url:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6211763

DOI: 10.1371/journal.pcbi.1006434
PubMed: 30346947
PubMed Central: 6211763

Links to Exploration step

PMC:6211763

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">A <italic>k</italic>
-mer-based method for the identification of phenotype-associated genomic biomarkers and predicting phenotypes of sequenced bacteria</title>
<author><name sortKey="Aun, Erki" sort="Aun, Erki" uniqKey="Aun E" first="Erki" last="Aun">Erki Aun</name>
<affiliation><nlm:aff id="aff001"><addr-line>Department of Bioinformatics, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia</addr-line>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Brauer, Age" sort="Brauer, Age" uniqKey="Brauer A" first="Age" last="Brauer">Age Brauer</name>
<affiliation><nlm:aff id="aff001"><addr-line>Department of Bioinformatics, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia</addr-line>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Kisand, Veljo" sort="Kisand, Veljo" uniqKey="Kisand V" first="Veljo" last="Kisand">Veljo Kisand</name>
<affiliation><nlm:aff id="aff002"><addr-line>Institute of Technology, University of Tartu, Tartu, Estonia</addr-line>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Tenson, Tanel" sort="Tenson, Tanel" uniqKey="Tenson T" first="Tanel" last="Tenson">Tanel Tenson</name>
<affiliation><nlm:aff id="aff002"><addr-line>Institute of Technology, University of Tartu, Tartu, Estonia</addr-line>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Remm, Maido" sort="Remm, Maido" uniqKey="Remm M" first="Maido" last="Remm">Maido Remm</name>
<affiliation><nlm:aff id="aff001"><addr-line>Department of Bioinformatics, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia</addr-line>
</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">30346947</idno>
<idno type="pmc">6211763</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6211763</idno>
<idno type="RBID">PMC:6211763</idno>
<idno type="doi">10.1371/journal.pcbi.1006434</idno>
<date when="2018">2018</date>
<idno type="wicri:Area/Pmc/Corpus">000F87</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000F87</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">A <italic>k</italic>
-mer-based method for the identification of phenotype-associated genomic biomarkers and predicting phenotypes of sequenced bacteria</title>
<author><name sortKey="Aun, Erki" sort="Aun, Erki" uniqKey="Aun E" first="Erki" last="Aun">Erki Aun</name>
<affiliation><nlm:aff id="aff001"><addr-line>Department of Bioinformatics, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia</addr-line>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Brauer, Age" sort="Brauer, Age" uniqKey="Brauer A" first="Age" last="Brauer">Age Brauer</name>
<affiliation><nlm:aff id="aff001"><addr-line>Department of Bioinformatics, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia</addr-line>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Kisand, Veljo" sort="Kisand, Veljo" uniqKey="Kisand V" first="Veljo" last="Kisand">Veljo Kisand</name>
<affiliation><nlm:aff id="aff002"><addr-line>Institute of Technology, University of Tartu, Tartu, Estonia</addr-line>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Tenson, Tanel" sort="Tenson, Tanel" uniqKey="Tenson T" first="Tanel" last="Tenson">Tanel Tenson</name>
<affiliation><nlm:aff id="aff002"><addr-line>Institute of Technology, University of Tartu, Tartu, Estonia</addr-line>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Remm, Maido" sort="Remm, Maido" uniqKey="Remm M" first="Maido" last="Remm">Maido Remm</name>
<affiliation><nlm:aff id="aff001"><addr-line>Department of Bioinformatics, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia</addr-line>
</nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">PLoS Computational Biology</title>
<idno type="ISSN">1553-734X</idno>
<idno type="eISSN">1553-7358</idno>
<imprint><date when="2018">2018</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><p>We have developed an easy-to-use and memory-efficient method called PhenotypeSeeker that (a) identifies phenotype-specific k-mers, (b) generates a <italic>k</italic>
-mer-based statistical model for predicting a given phenotype and (c) predicts the phenotype from the sequencing data of a given bacterial isolate. The method was validated on 167 <italic>Klebsiella pneumoniae</italic>
 isolates (virulence), 200 <italic>Pseudomonas aeruginosa</italic>
 isolates (ciprofloxacin resistance) and 459 <italic>Clostridium difficile</italic>
 isolates (azithromycin resistance). The phenotype prediction models trained from these datasets obtained the F1-measure of 0.88 on the <italic>K</italic>
. <italic>pneumoniae</italic>
 test set, 0.88 on the <italic>P</italic>
. <italic>aeruginosa</italic>
 test set and 0.97 on the <italic>C</italic>
. <italic>difficile</italic>
 test set. The F1-measures were the same for assembled sequences and raw sequencing data; however, building the model from assembled genomes is significantly faster. On these datasets, the model building on a mid-range Linux server takes approximately 3 to 5 hours per phenotype if assembled genomes are used and 10 hours per phenotype if raw sequencing data are used. The phenotype prediction from assembled genomes takes less than one second per isolate. Thus, PhenotypeSeeker should be well-suited for predicting phenotypes from large sequencing datasets. PhenotypeSeeker is implemented in Python programming language, is open-source software and is available at GitHub (<ext-link ext-link-type="uri" xlink:href="https://github.com/bioinfo-ut/PhenotypeSeeker/">https://github.com/bioinfo-ut/PhenotypeSeeker/</ext-link>
).</p>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Kisand, V" uniqKey="Kisand V">V Kisand</name>
</author>
<author><name sortKey="Lettieri, T" uniqKey="Lettieri T">T Lettieri</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Crofts, Ts" uniqKey="Crofts T">TS Crofts</name>
</author>
<author><name sortKey="Gasparrini, Aj" uniqKey="Gasparrini A">AJ Gasparrini</name>
</author>
<author><name sortKey="Dantas, G" uniqKey="Dantas G">G Dantas</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bakour, S" uniqKey="Bakour S">S Bakour</name>
</author>
<author><name sortKey="Sankar, Sa" uniqKey="Sankar S">SA Sankar</name>
</author>
<author><name sortKey="Rathored, J" uniqKey="Rathored J">J Rathored</name>
</author>
<author><name sortKey="Biagini, P" uniqKey="Biagini P">P Biagini</name>
</author>
<author><name sortKey="Raoult, D" uniqKey="Raoult D">D Raoult</name>
</author>
<author><name sortKey="Fournier, P E" uniqKey="Fournier P">P-E Fournier</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wheeler, Ne" uniqKey="Wheeler N">NE Wheeler</name>
</author>
<author><name sortKey="Gardner, Pp" uniqKey="Gardner P">PP Gardner</name>
</author>
<author><name sortKey="Barquist, L" uniqKey="Barquist L">L Barquist</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
<author><name sortKey="Metcalf, Bj" uniqKey="Metcalf B">BJ Metcalf</name>
</author>
<author><name sortKey="Chochua, S" uniqKey="Chochua S">S Chochua</name>
</author>
<author><name sortKey="Li, Z" uniqKey="Li Z">Z Li</name>
</author>
<author><name sortKey="Gertz, Re" uniqKey="Gertz R">RE Gertz</name>
</author>
<author><name sortKey="Walker, H" uniqKey="Walker H">H Walker</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lees, Ja" uniqKey="Lees J">JA Lees</name>
</author>
<author><name sortKey="Vehkala, M" uniqKey="Vehkala M">M Vehkala</name>
</author>
<author><name sortKey="V Lim Ki, N" uniqKey="V Lim Ki N">N Välimäki</name>
</author>
<author><name sortKey="Harris, Sr" uniqKey="Harris S">SR Harris</name>
</author>
<author><name sortKey="Chewapreecha, C" uniqKey="Chewapreecha C">C Chewapreecha</name>
</author>
<author><name sortKey="Croucher, Nj" uniqKey="Croucher N">NJ Croucher</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Nguyen, M" uniqKey="Nguyen M">M Nguyen</name>
</author>
<author><name sortKey="Brettin, T" uniqKey="Brettin T">T Brettin</name>
</author>
<author><name sortKey="Long, Sw" uniqKey="Long S">SW Long</name>
</author>
<author><name sortKey="Musser, Jm" uniqKey="Musser J">JM Musser</name>
</author>
<author><name sortKey="Olsen, Rj" uniqKey="Olsen R">RJ Olsen</name>
</author>
<author><name sortKey="Olson, R" uniqKey="Olson R">R Olson</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Davis, Jj" uniqKey="Davis J">JJ Davis</name>
</author>
<author><name sortKey="Boisvert, S" uniqKey="Boisvert S">S Boisvert</name>
</author>
<author><name sortKey="Brettin, T" uniqKey="Brettin T">T Brettin</name>
</author>
<author><name sortKey="Kenyon, Rw" uniqKey="Kenyon R">RW Kenyon</name>
</author>
<author><name sortKey="Mao, C" uniqKey="Mao C">C Mao</name>
</author>
<author><name sortKey="Olson, R" uniqKey="Olson R">R Olson</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Drouin, A" uniqKey="Drouin A">A Drouin</name>
</author>
<author><name sortKey="Giguere, S" uniqKey="Giguere S">S Giguère</name>
</author>
<author><name sortKey="Deraspe, M" uniqKey="Deraspe M">M Déraspe</name>
</author>
<author><name sortKey="Marchand, M" uniqKey="Marchand M">M Marchand</name>
</author>
<author><name sortKey="Tyers, M" uniqKey="Tyers M">M Tyers</name>
</author>
<author><name sortKey="Loo, Vg" uniqKey="Loo V">VG Loo</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Marinier, E" uniqKey="Marinier E">E Marinier</name>
</author>
<author><name sortKey="Zaheer, R" uniqKey="Zaheer R">R Zaheer</name>
</author>
<author><name sortKey="Berry, C" uniqKey="Berry C">C Berry</name>
</author>
<author><name sortKey="Weedmark, Ka" uniqKey="Weedmark K">KA Weedmark</name>
</author>
<author><name sortKey="Domaratzki, M" uniqKey="Domaratzki M">M Domaratzki</name>
</author>
<author><name sortKey="Mabon, P" uniqKey="Mabon P">P Mabon</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kaplinski, L" uniqKey="Kaplinski L">L Kaplinski</name>
</author>
<author><name sortKey="Lepamets, M" uniqKey="Lepamets M">M Lepamets</name>
</author>
<author><name sortKey="Remm, M" uniqKey="Remm M">M Remm</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ondov, Bd" uniqKey="Ondov B">BD Ondov</name>
</author>
<author><name sortKey="Treangen, Tj" uniqKey="Treangen T">TJ Treangen</name>
</author>
<author><name sortKey="Melsted, P" uniqKey="Melsted P">P Melsted</name>
</author>
<author><name sortKey="Mallonee, Ab" uniqKey="Mallonee A">AB Mallonee</name>
</author>
<author><name sortKey="Bergman, Nh" uniqKey="Bergman N">NH Bergman</name>
</author>
<author><name sortKey="Koren, S" uniqKey="Koren S">S Koren</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Gerstein, M" uniqKey="Gerstein M">M Gerstein</name>
</author>
<author><name sortKey="Sonnhammer, El" uniqKey="Sonnhammer E">EL Sonnhammer</name>
</author>
<author><name sortKey="Chothia, C" uniqKey="Chothia C">C Chothia</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Pajuste, F D" uniqKey="Pajuste F">F-D Pajuste</name>
</author>
<author><name sortKey="Kaplinski, L" uniqKey="Kaplinski L">L Kaplinski</name>
</author>
<author><name sortKey="Mols, M" uniqKey="Mols M">M Möls</name>
</author>
<author><name sortKey="Puurand, T" uniqKey="Puurand T">T Puurand</name>
</author>
<author><name sortKey="Lepamets, M" uniqKey="Lepamets M">M Lepamets</name>
</author>
<author><name sortKey="Remm, M" uniqKey="Remm M">M Remm</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Barker, Kf" uniqKey="Barker K">KF Barker</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Susceptibilitytesting Ec On, A" uniqKey="Susceptibilitytesting Ec On A">A SusceptibilityTesting EC on</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Fabrega, A" uniqKey="Fabrega A">A Fàbrega</name>
</author>
<author><name sortKey="Madurga, S" uniqKey="Madurga S">S Madurga</name>
</author>
<author><name sortKey="Giralt, E" uniqKey="Giralt E">E Giralt</name>
</author>
<author><name sortKey="Vila, J" uniqKey="Vila J">J Vila</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Jalal, S" uniqKey="Jalal S">S Jalal</name>
</author>
<author><name sortKey="Wretlind, B" uniqKey="Wretlind B">B Wretlind</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kaminska, Kh" uniqKey="Kaminska K">KH Kaminska</name>
</author>
<author><name sortKey="Purta, E" uniqKey="Purta E">E Purta</name>
</author>
<author><name sortKey="Hansen, Lh" uniqKey="Hansen L">LH Hansen</name>
</author>
<author><name sortKey="Bujnicki, Jm" uniqKey="Bujnicki J">JM Bujnicki</name>
</author>
<author><name sortKey="Vester, B" uniqKey="Vester B">B Vester</name>
</author>
<author><name sortKey="Long, Ks" uniqKey="Long K">KS Long</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Carniel, E" uniqKey="Carniel E">E Carniel</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Chen, Yt" uniqKey="Chen Y">YT Chen</name>
</author>
<author><name sortKey="Chang, Hy" uniqKey="Chang H">HY Chang</name>
</author>
<author><name sortKey="Lai, Yc" uniqKey="Lai Y">YC Lai</name>
</author>
<author><name sortKey="Pan, Cc" uniqKey="Pan C">CC Pan</name>
</author>
<author><name sortKey="Tsai, Sf" uniqKey="Tsai S">SF Tsai</name>
</author>
<author><name sortKey="Peng, Hl" uniqKey="Peng H">HL Peng</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lagos, R" uniqKey="Lagos R">R Lagos</name>
</author>
<author><name sortKey="Baeza, M" uniqKey="Baeza M">M Baeza</name>
</author>
<author><name sortKey="Corsini, G" uniqKey="Corsini G">G Corsini</name>
</author>
<author><name sortKey="Hetz, C" uniqKey="Hetz C">C Hetz</name>
</author>
<author><name sortKey="Strahsburger, E" uniqKey="Strahsburger E">E Strahsburger</name>
</author>
<author><name sortKey="Castillo, Ja" uniqKey="Castillo J">JA Castillo</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Nassif, X" uniqKey="Nassif X">X Nassif</name>
</author>
<author><name sortKey="Sansonetti, Pj" uniqKey="Sansonetti P">PJ Sansonetti</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Putze, J" uniqKey="Putze J">J Putze</name>
</author>
<author><name sortKey="Hennequin, C" uniqKey="Hennequin C">C Hennequin</name>
</author>
<author><name sortKey="Nougayrede, Jp" uniqKey="Nougayrede J">JP Nougayrède</name>
</author>
<author><name sortKey="Zhang, W" uniqKey="Zhang W">W Zhang</name>
</author>
<author><name sortKey="Homburg, S" uniqKey="Homburg S">S Homburg</name>
</author>
<author><name sortKey="Karch, H" uniqKey="Karch H">H Karch</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Chou, Hc" uniqKey="Chou H">HC Chou</name>
</author>
<author><name sortKey="Lee, Cz" uniqKey="Lee C">CZ Lee</name>
</author>
<author><name sortKey="Ma, Lc" uniqKey="Ma L">LC Ma</name>
</author>
<author><name sortKey="Fang, Ct" uniqKey="Fang C">CT Fang</name>
</author>
<author><name sortKey="Chang, Sc" uniqKey="Chang S">SC Chang</name>
</author>
<author><name sortKey="Wang, Jt" uniqKey="Wang J">JT Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Cheng, Hy" uniqKey="Cheng H">HY Cheng</name>
</author>
<author><name sortKey="Chen, Ys" uniqKey="Chen Y">YS Chen</name>
</author>
<author><name sortKey="Wu, Cy" uniqKey="Wu C">CY Wu</name>
</author>
<author><name sortKey="Chang, Hy" uniqKey="Chang H">HY Chang</name>
</author>
<author><name sortKey="Lai, Yc" uniqKey="Lai Y">YC Lai</name>
</author>
<author><name sortKey="Peng, Hl" uniqKey="Peng H">HL Peng</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lai, Y" uniqKey="Lai Y">Y Lai</name>
</author>
<author><name sortKey="Peng, H" uniqKey="Peng H">H Peng</name>
</author>
<author><name sortKey="Chang, H" uniqKey="Chang H">H Chang</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ma, L C" uniqKey="Ma L">L-C Ma</name>
</author>
<author><name sortKey="Fang, C T" uniqKey="Fang C">C-T Fang</name>
</author>
<author><name sortKey="Lee, C Z" uniqKey="Lee C">C-Z Lee</name>
</author>
<author><name sortKey="Shun, C T" uniqKey="Shun C">C-T Shun</name>
</author>
<author><name sortKey="Wang, J T" uniqKey="Wang J">J-T Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lai, Yc" uniqKey="Lai Y">YC Lai</name>
</author>
<author><name sortKey="Lin, G T" uniqKey="Lin G">G-T Lin</name>
</author>
<author><name sortKey="Yang, S L" uniqKey="Yang S">S-L Yang</name>
</author>
<author><name sortKey="Chang, H Y" uniqKey="Chang H">H-Y Chang</name>
</author>
<author><name sortKey="Peng, H L" uniqKey="Peng H">H-L Peng</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bankevich, A" uniqKey="Bankevich A">A Bankevich</name>
</author>
<author><name sortKey="Nurk, S" uniqKey="Nurk S">S Nurk</name>
</author>
<author><name sortKey="Antipov, D" uniqKey="Antipov D">D Antipov</name>
</author>
<author><name sortKey="Gurevich, Aa" uniqKey="Gurevich A">AA Gurevich</name>
</author>
<author><name sortKey="Dvorkin, M" uniqKey="Dvorkin M">M Dvorkin</name>
</author>
<author><name sortKey="Kulikov, As" uniqKey="Kulikov A">AS Kulikov</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Holt, Ke" uniqKey="Holt K">KE Holt</name>
</author>
<author><name sortKey="Wertheim, H" uniqKey="Wertheim H">H Wertheim</name>
</author>
<author><name sortKey="Zadoks, Rn" uniqKey="Zadoks R">RN Zadoks</name>
</author>
<author><name sortKey="Baker, S" uniqKey="Baker S">S Baker</name>
</author>
<author><name sortKey="Whitehouse, Ca" uniqKey="Whitehouse C">CA Whitehouse</name>
</author>
<author><name sortKey="Dance, D" uniqKey="Dance D">D Dance</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Knight, R" uniqKey="Knight R">R Knight</name>
</author>
<author><name sortKey="Maxwell, P" uniqKey="Maxwell P">P Maxwell</name>
</author>
<author><name sortKey="Birmingham, A" uniqKey="Birmingham A">A Birmingham</name>
</author>
<author><name sortKey="Carnes, J" uniqKey="Carnes J">J Carnes</name>
</author>
<author><name sortKey="Caporaso, Jg" uniqKey="Caporaso J">JG Caporaso</name>
</author>
<author><name sortKey="Easton, Bc" uniqKey="Easton B">BC Easton</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Josh Pasek, A" uniqKey="Josh Pasek A">A Josh Pasek</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Pedregosa, F" uniqKey="Pedregosa F">F Pedregosa</name>
</author>
<author><name sortKey="Varoquaux, G" uniqKey="Varoquaux G">G Varoquaux</name>
</author>
<author><name sortKey="Gramfort, A" uniqKey="Gramfort A">A Gramfort</name>
</author>
<author><name sortKey="Michel, V" uniqKey="Michel V">V Michel</name>
</author>
<author><name sortKey="Thirion, B" uniqKey="Thirion B">B Thirion</name>
</author>
<author><name sortKey="Grisel, O" uniqKey="Grisel O">O Grisel</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">PLoS Comput Biol</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS Comput. Biol</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">ploscomp</journal-id>
<journal-title-group><journal-title>PLoS Computational Biology</journal-title>
</journal-title-group>
<issn pub-type="ppub">1553-734X</issn>
<issn pub-type="epub">1553-7358</issn>
<publisher><publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, CA USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">30346947</article-id>
<article-id pub-id-type="pmc">6211763</article-id>
<article-id pub-id-type="doi">10.1371/journal.pcbi.1006434</article-id>
<article-id pub-id-type="publisher-id">PCOMPBIOL-D-18-00544</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject>
</subj-group>
<subj-group subj-group-type="Discipline-v3"><subject>Biology and Life Sciences</subject>
<subj-group><subject>Microbiology</subject>
<subj-group><subject>Medical Microbiology</subject>
<subj-group><subject>Microbial Pathogens</subject>
<subj-group><subject>Bacterial Pathogens</subject>
<subj-group><subject>Pseudomonas Aeruginosa</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3"><subject>Medicine and Health Sciences</subject>
<subj-group><subject>Pathology and Laboratory Medicine</subject>
<subj-group><subject>Pathogens</subject>
<subj-group><subject>Microbial Pathogens</subject>
<subj-group><subject>Bacterial Pathogens</subject>
<subj-group><subject>Pseudomonas Aeruginosa</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3"><subject>Biology and Life Sciences</subject>
<subj-group><subject>Organisms</subject>
<subj-group><subject>Bacteria</subject>
<subj-group><subject>Pseudomonas</subject>
<subj-group><subject>Pseudomonas Aeruginosa</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3"><subject>Research and Analysis Methods</subject>
<subj-group><subject>Mathematical and Statistical Techniques</subject>
<subj-group><subject>Statistical Methods</subject>
<subj-group><subject>Forecasting</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3"><subject>Physical Sciences</subject>
<subj-group><subject>Mathematics</subject>
<subj-group><subject>Statistics</subject>
<subj-group><subject>Statistical Methods</subject>
<subj-group><subject>Forecasting</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3"><subject>Biology and Life Sciences</subject>
<subj-group><subject>Microbiology</subject>
<subj-group><subject>Microbial Control</subject>
<subj-group><subject>Antimicrobial Resistance</subject>
<subj-group><subject>Antibiotic Resistance</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3"><subject>Medicine and Health Sciences</subject>
<subj-group><subject>Pharmacology</subject>
<subj-group><subject>Antimicrobial Resistance</subject>
<subj-group><subject>Antibiotic Resistance</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3"><subject>Biology and Life Sciences</subject>
<subj-group><subject>Organisms</subject>
<subj-group><subject>Bacteria</subject>
<subj-group><subject>Gut Bacteria</subject>
<subj-group><subject>Clostridium Difficile</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3"><subject>Biology and Life Sciences</subject>
<subj-group><subject>Computational Biology</subject>
<subj-group><subject>Genome Analysis</subject>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3"><subject>Biology and Life Sciences</subject>
<subj-group><subject>Genetics</subject>
<subj-group><subject>Genomics</subject>
<subj-group><subject>Genome Analysis</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3"><subject>Biology and Life Sciences</subject>
<subj-group><subject>Organisms</subject>
<subj-group><subject>Bacteria</subject>
<subj-group><subject>Klebsiella</subject>
<subj-group><subject>Klebsiella Pneumoniae</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3"><subject>Biology and Life Sciences</subject>
<subj-group><subject>Microbiology</subject>
<subj-group><subject>Medical Microbiology</subject>
<subj-group><subject>Microbial Pathogens</subject>
<subj-group><subject>Bacterial Pathogens</subject>
<subj-group><subject>Klebsiella</subject>
<subj-group><subject>Klebsiella Pneumoniae</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3"><subject>Medicine and Health Sciences</subject>
<subj-group><subject>Pathology and Laboratory Medicine</subject>
<subj-group><subject>Pathogens</subject>
<subj-group><subject>Microbial Pathogens</subject>
<subj-group><subject>Bacterial Pathogens</subject>
<subj-group><subject>Klebsiella</subject>
<subj-group><subject>Klebsiella Pneumoniae</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3"><subject>Biology and Life Sciences</subject>
<subj-group><subject>Molecular Biology</subject>
<subj-group><subject>Molecular Biology Techniques</subject>
<subj-group><subject>Sequencing Techniques</subject>
<subj-group><subject>Genome Sequencing</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3"><subject>Research and Analysis Methods</subject>
<subj-group><subject>Molecular Biology Techniques</subject>
<subj-group><subject>Sequencing Techniques</subject>
<subj-group><subject>Genome Sequencing</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3"><subject>Biology and Life Sciences</subject>
<subj-group><subject>Genetics</subject>
<subj-group><subject>Gene Identification and Analysis</subject>
<subj-group><subject>Mutation Detection</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</article-categories>
<title-group><article-title>A <italic>k</italic>
-mer-based method for the identification of phenotype-associated genomic biomarkers and predicting phenotypes of sequenced bacteria</article-title>
<alt-title alt-title-type="running-head">A method to identify phenotype-specific <italic>k</italic>
-mers and predict phenotypes</alt-title>
</title-group>
<contrib-group><contrib contrib-type="author"><contrib-id authenticated="true" contrib-id-type="orcid">http://orcid.org/0000-0001-7446-3524</contrib-id>
<name><surname>Aun</surname>
<given-names>Erki</given-names>
</name>
<role content-type="http://credit.casrai.org/">Software</role>
<role content-type="http://credit.casrai.org/">Writing – original draft</role>
<xref ref-type="aff" rid="aff001"><sup>1</sup>
</xref>
<xref ref-type="corresp" rid="cor001">*</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Brauer</surname>
<given-names>Age</given-names>
</name>
<role content-type="http://credit.casrai.org/">Investigation</role>
<role content-type="http://credit.casrai.org/">Validation</role>
<role content-type="http://credit.casrai.org/">Writing – original draft</role>
<xref ref-type="aff" rid="aff001"><sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Kisand</surname>
<given-names>Veljo</given-names>
</name>
<role content-type="http://credit.casrai.org/">Data curation</role>
<role content-type="http://credit.casrai.org/">Resources</role>
<role content-type="http://credit.casrai.org/">Writing – original draft</role>
<xref ref-type="aff" rid="aff002"><sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Tenson</surname>
<given-names>Tanel</given-names>
</name>
<role content-type="http://credit.casrai.org/">Data curation</role>
<role content-type="http://credit.casrai.org/">Resources</role>
<role content-type="http://credit.casrai.org/">Writing – original draft</role>
<xref ref-type="aff" rid="aff002"><sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author"><contrib-id authenticated="true" contrib-id-type="orcid">http://orcid.org/0000-0003-3966-8422</contrib-id>
<name><surname>Remm</surname>
<given-names>Maido</given-names>
</name>
<role content-type="http://credit.casrai.org/">Conceptualization</role>
<role content-type="http://credit.casrai.org/">Funding acquisition</role>
<role content-type="http://credit.casrai.org/">Methodology</role>
<role content-type="http://credit.casrai.org/">Project administration</role>
<role content-type="http://credit.casrai.org/">Supervision</role>
<role content-type="http://credit.casrai.org/">Writing – original draft</role>
<xref ref-type="aff" rid="aff001"><sup>1</sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff001"><label>1</label>
<addr-line>Department of Bioinformatics, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia</addr-line>
</aff>
<aff id="aff002"><label>2</label>
<addr-line>Institute of Technology, University of Tartu, Tartu, Estonia</addr-line>
</aff>
<contrib-group><contrib contrib-type="editor"><name><surname>Ouzounis</surname>
<given-names>Christos A.</given-names>
</name>
<role>Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1"><addr-line>CPERI, GREECE</addr-line>
</aff>
<author-notes><fn fn-type="COI-statement" id="coi001"><p>The authors have declared that no competing interests exist.</p>
</fn>
<corresp id="cor001">* E-mail: <email>erki.aun@ut.ee</email>
</corresp>
</author-notes>
<pub-date pub-type="epub"><day>22</day>
<month>10</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="collection"><month>10</month>
<year>2018</year>
</pub-date>
<volume>14</volume>
<issue>10</issue>
<elocation-id>e1006434</elocation-id>
<history><date date-type="received"><day>9</day>
<month>4</month>
<year>2018</year>
</date>
<date date-type="accepted"><day>15</day>
<month>8</month>
<year>2018</year>
</date>
</history>
<permissions><copyright-statement>© 2018 Aun et al</copyright-statement>
<copyright-year>2018</copyright-year>
<copyright-holder>Aun et al</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><license-p>This is an open access article distributed under the terms of the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License</ext-link>
, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="pcbi.1006434.pdf"></self-uri>
<abstract><p>We have developed an easy-to-use and memory-efficient method called PhenotypeSeeker that (a) identifies phenotype-specific k-mers, (b) generates a <italic>k</italic>
-mer-based statistical model for predicting a given phenotype and (c) predicts the phenotype from the sequencing data of a given bacterial isolate. The method was validated on 167 <italic>Klebsiella pneumoniae</italic>
 isolates (virulence), 200 <italic>Pseudomonas aeruginosa</italic>
 isolates (ciprofloxacin resistance) and 459 <italic>Clostridium difficile</italic>
 isolates (azithromycin resistance). The phenotype prediction models trained from these datasets obtained the F1-measure of 0.88 on the <italic>K</italic>
. <italic>pneumoniae</italic>
 test set, 0.88 on the <italic>P</italic>
. <italic>aeruginosa</italic>
 test set and 0.97 on the <italic>C</italic>
. <italic>difficile</italic>
 test set. The F1-measures were the same for assembled sequences and raw sequencing data; however, building the model from assembled genomes is significantly faster. On these datasets, the model building on a mid-range Linux server takes approximately 3 to 5 hours per phenotype if assembled genomes are used and 10 hours per phenotype if raw sequencing data are used. The phenotype prediction from assembled genomes takes less than one second per isolate. Thus, PhenotypeSeeker should be well-suited for predicting phenotypes from large sequencing datasets. PhenotypeSeeker is implemented in Python programming language, is open-source software and is available at GitHub (<ext-link ext-link-type="uri" xlink:href="https://github.com/bioinfo-ut/PhenotypeSeeker/">https://github.com/bioinfo-ut/PhenotypeSeeker/</ext-link>
).</p>
</abstract>
<abstract abstract-type="summary"><title>Author summary</title>
<p>Predicting phenotypic properties of bacterial isolates from their genomic sequences has numerous potential applications. A good example would be prediction of antimicrobial resistance and virulence phenotypes for use in medical diagnostics. We have developed a method that is able to predict phenotypes of interest from the genomic sequence of the isolate within seconds. The method uses a statistical model that can be trained automatically on isolates with known phenotype. The method is implemented in Python programming language and can be run on low-end Linux server and/or on laptop computers.</p>
</abstract>
<funding-group><award-group id="award001"><funding-source><institution>Estonian Research Council</institution>
</funding-source>
<award-id>IUT2-22</award-id>
<principal-award-recipient><name><surname>Tenson</surname>
<given-names>Tanel</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group id="award002"><funding-source><institution>Estonian Research Council</institution>
</funding-source>
<award-id>IUT34-11</award-id>
<principal-award-recipient><contrib-id authenticated="true" contrib-id-type="orcid">http://orcid.org/0000-0003-3966-8422</contrib-id>
<name><surname>Remm</surname>
<given-names>Maido</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group id="award003"><funding-source><institution>EU ERDF</institution>
</funding-source>
<award-id>2014-2020.4.01.15-0012</award-id>
</award-group>
<award-group id="award004"><funding-source><institution>EU ERDF</institution>
</funding-source>
<award-id>2014-2020.4.01.15-0013</award-id>
</award-group>
<funding-statement>This work was funded by institutional grants IUT34-11 (MR) and IUT2-22 (TT) from Estonian Research Council (<ext-link ext-link-type="uri" xlink:href="http://www.etag.ee/en/estonian-research-council/">http://www.etag.ee/en/estonian-research-council/</ext-link>
) and by the grants No. 2014-2020.4.01.15-0012 (to Estonian Centre of Excellence in Genomics and Translational Medicine) and No. 2014-2020.4.01.15-0013 (to Estonian Centre of Excellence in Molecular Cell Engineering) from European Regional Development Fond (<ext-link ext-link-type="uri" xlink:href="http://ec.europa.eu/regional_policy/en/funding/erdf/">http://ec.europa.eu/regional_policy/en/funding/erdf/</ext-link>
). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</funding-statement>
</funding-group>
<counts><fig-count count="4"></fig-count>
<table-count count="2"></table-count>
<page-count count="17"></page-count>
</counts>
<custom-meta-group><custom-meta><meta-name>PLOS Publication Stage</meta-name>
<meta-value>vor-update-to-uncorrected-proof</meta-value>
</custom-meta>
<custom-meta><meta-name>Publication Update</meta-name>
<meta-value>2018-11-01</meta-value>
</custom-meta>
<custom-meta id="data-availability"><meta-name>Data Availability</meta-name>
<meta-value>PhenotypeSeeker software is available at GitHub (<ext-link ext-link-type="uri" xlink:href="https://github.com/bioinfo-ut/PhenotypeSeeker">https://github.com/bioinfo-ut/PhenotypeSeeker</ext-link>
). C.difficile genomes used for software validation are available from European Nucleotide Archive [EMBL:PRJEB11776 ((<ext-link ext-link-type="uri" xlink:href="http://www.ebi.ac.uk/ena/data/view/PRJEB11776">http://www.ebi.ac.uk/ena/data/view/PRJEB11776</ext-link>
)]. The binary phenotypes of azithromycin resistance for these C. difficile genomes are from Drouin et al. 2016 (Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons. BMC Genomics [Internet]. 17(1):754. Available from: <ext-link ext-link-type="uri" xlink:href="http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2889-6">http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2889-6</ext-link>
). K. pneumoniae genomes used for software validation are available from European Nucleotide Archive [EMBL:PRJEB2111 ((<ext-link ext-link-type="uri" xlink:href="https://www.ebi.ac.uk/ena/data/view/PRJEB2111">https://www.ebi.ac.uk/ena/data/view/PRJEB2111</ext-link>
)]. The binary phenotypes of infection status (infection/carriage) for these K.pneumoniae genomes are from Holt et al. 2015 (Genomic analysis of diversity, population structure, virulence, and antimicrobial resistance in Klebsiella pneumoniae, an urgent threat to public health. Proc Natl Acad Sci [Internet]. 112(27):E3574–81. Available from: <ext-link ext-link-type="uri" xlink:href="http://www.pnas.org/lookup/doi/10.1073/pnas.1501049112">http://www.pnas.org/lookup/doi/10.1073/pnas.1501049112</ext-link>
). The P. aeruginosa dataset used for software validation is available from the NCBI's BioProject database [Accession: PRJNA244279 (<ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA244279">https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA244279</ext-link>
)].</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
<notes><title>Data Availability</title>
<p>PhenotypeSeeker software is available at GitHub (<ext-link ext-link-type="uri" xlink:href="https://github.com/bioinfo-ut/PhenotypeSeeker">https://github.com/bioinfo-ut/PhenotypeSeeker</ext-link>
). C.difficile genomes used for software validation are available from European Nucleotide Archive [EMBL:PRJEB11776 ((<ext-link ext-link-type="uri" xlink:href="http://www.ebi.ac.uk/ena/data/view/PRJEB11776">http://www.ebi.ac.uk/ena/data/view/PRJEB11776</ext-link>
)]. The binary phenotypes of azithromycin resistance for these C. difficile genomes are from Drouin et al. 2016 (Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons. BMC Genomics [Internet]. 17(1):754. Available from: <ext-link ext-link-type="uri" xlink:href="http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2889-6">http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2889-6</ext-link>
). K. pneumoniae genomes used for software validation are available from European Nucleotide Archive [EMBL:PRJEB2111 ((<ext-link ext-link-type="uri" xlink:href="https://www.ebi.ac.uk/ena/data/view/PRJEB2111">https://www.ebi.ac.uk/ena/data/view/PRJEB2111</ext-link>
)]. The binary phenotypes of infection status (infection/carriage) for these K.pneumoniae genomes are from Holt et al. 2015 (Genomic analysis of diversity, population structure, virulence, and antimicrobial resistance in Klebsiella pneumoniae, an urgent threat to public health. Proc Natl Acad Sci [Internet]. 112(27):E3574–81. Available from: <ext-link ext-link-type="uri" xlink:href="http://www.pnas.org/lookup/doi/10.1073/pnas.1501049112">http://www.pnas.org/lookup/doi/10.1073/pnas.1501049112</ext-link>
). The P. aeruginosa dataset used for software validation is available from the NCBI's BioProject database [Accession: PRJNA244279 (<ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA244279">https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA244279</ext-link>
)].</p>
</notes>
</front>
<body><disp-quote><p>This is a <italic>PLOS Computational Biology</italic>
 Methods paper.</p>
</disp-quote>
<sec sec-type="intro" id="sec001"><title>Introduction</title>
<p>The falling cost of sequencing has made genome sequencing affordable to a large number of labs, and therefore, there has been a dramatic increase in the number of genome sequences available for comparison in the public domain [<xref rid="pcbi.1006434.ref001" ref-type="bibr">1</xref>
]. These developments have facilitated the genomic analysis of bacterial isolates. An increasing amount of bacterial whole genome sequencing (WGS) data has led to more and more genome-wide studies of DNA variation related to different phenotypes. Among these studies, antibiotic resistance phenotypes are the most concerning and have garnered high public interest, especially since several multidrug-resistant strains have emerged worldwide. The detection of known resistance-causing mutations as well as the search for new candidate biomarkers leading to resistance phenotypes requires reasonably rapid and easily applicable tools for processing and comparing the sequencing data of hundreds of isolated strains. However, there is still a lack of user-friendly software tools for the identification of genomic biomarkers from large sequencing datasets of bacterial isolates [<xref rid="pcbi.1006434.ref002" ref-type="bibr">2</xref>
,<xref rid="pcbi.1006434.ref003" ref-type="bibr">3</xref>
].</p>
<p>While microbial genome-wide association studies (GWAS) can be successfully used in case of previously known genotype-phenotype associations caused by a single gene or only a set of few and specific mutations, more complex and novel associations would remain undetected. In addition, many bacterial species have extensive intra-species variation from small sequence-based differences to the absence or presence of whole genes or gene clusters. Choosing only one genome as a reference for searching for the variable components would be highly limiting.</p>
<p>Alternative approaches use previously detected genomic features, either single nucleotide variations or longer sequences, behind the phenotype to create and train models using those features as the predictors. Not only antibiotic resistance but wide range of other phenotypes can be predicted, e.g host adaptation in invasive serovars [<xref rid="pcbi.1006434.ref004" ref-type="bibr">4</xref>
], needed minimum inhibitory concentrations of antibiotics [<xref rid="pcbi.1006434.ref005" ref-type="bibr">5</xref>
] or virulence of the strains [<xref rid="pcbi.1006434.ref006" ref-type="bibr">6</xref>
]. Using longer sequence regions, such as full genes in those models, requires assembled genomes as an input which adds data preprocessing step. The solution to avoid this is using <italic>k</italic>
-mers, which are short DNA oligomers with length <italic>k</italic>
, that enable us to simultaneously discover a large set of single nucleotide variations, insertions and deletions associated with the phenotypes under study. The advantage of using <italic>k</italic>
-mer-based methods in genomic biomarker discovery is that they do not require sequence alignments and can even be applied to raw sequencing data.</p>
<p>In recent years several publications using different machine learning algorithms and <italic>k</italic>
-mers for detecting the biomarkers behind different bacterial phenotypes have been published. Among the latest, short <italic>k</italic>
-mers and machine learning (ML) has been used to create minimum inhibitory concentration prediction models in assembled <italic>Klebsiella pneumoniae</italic>
 genomes for several antibiotics [<xref rid="pcbi.1006434.ref007" ref-type="bibr">7</xref>
]. PATRIC and RAST annotation services include prediction of antimicrobial resistance with the species- and antibiotic-specific classifier <italic>k</italic>
-mers which are selected using publicly available and collected metadata and the adaptive boosting ML algorithms [<xref rid="pcbi.1006434.ref008" ref-type="bibr">8</xref>
].</p>
<p>Though providing a framework or predictive models for a specific species with a certain phenotype, those studies have not been concentrating on the creation of a software easily applicable by a wider public without an access to extensive computing resources but still having the need for analyzing large scale bacterial genome sequencing data with a reasonable amount of computing time. Only few papers describe software which we were able to compare with PhenotypeSeeker.</p>
<p>The SEER program takes either a discrete or continuous phenotype as an input, counts variable-length <italic>k</italic>
-mers and corrects for the clonal population structure [<xref rid="pcbi.1006434.ref006" ref-type="bibr">6</xref>
]. SEER is a complex pipeline requiring several separate steps for the user to execute and currently has many system-level dependencies for successful compilation and installation. Another similar tool, Kover, handles only discrete phenotypes, counts user-defined size <italic>k</italic>
-mers and does not use any correction for population structure [<xref rid="pcbi.1006434.ref009" ref-type="bibr">9</xref>
]. The Neptune software targets so-called 'signatures' differentiating two groups of sequences but cannot locate smaller mutations, such as single isolated nucleotide variations, being the reason, it was not used in the comparison in current paper. The 'signatures' that Neptune detects are relatively large genomic loci, which may include genomic islands, phage regions or operons [<xref rid="pcbi.1006434.ref010" ref-type="bibr">10</xref>
].</p>
<p>We created PhenotypeSeeker as we observed the need for a tool that could combine all the benefits of the programs available but at the same time would be easily executable and would take a reasonable amount of computing resources without the need for dedicated high-performance computer hardware.</p>
</sec>
<sec sec-type="results" id="sec002"><title>Results</title>
<sec id="sec003"><title>Implementation</title>
<p>PhenotypeSeeker consist of two subprograms: 'PhenotypeSeeker modeling' and 'PhenotypeSeeker prediction'. 'PhenotypeSeeker modeling' takes either assembled contigs or raw-read data as an input and builds a statistical model for phenotype prediction. The method starts with counting all possible <italic>k</italic>
-mers from the input genomes, using the GenomeTester4 software package [<xref rid="pcbi.1006434.ref011" ref-type="bibr">11</xref>
], followed by <italic>k</italic>
-mer filtering by their frequency in strains. Subsequently, the <italic>k</italic>
-mer selection for regression analysis is performed. In this step, to test the <italic>k</italic>
-mers’ association with the phenotype, the method applies Welch’s two-sample t-test if the phenotype is continuous and a chi-squared test if it is binary. Finally, the logistic regression or linear regression model is built. The PhenotypeSeeker output gives the regression model in a binary format and three text files, which include the following: (1) the results of association tests for identifying the <italic>k</italic>
-mers most strongly associated with the given phenotype, (2) the coefficients of <italic>k</italic>
-mers in the regression model for identifying the <italic>k</italic>
-mers that have the greatest effects on the outcomes of the machine learning model, (3) a FASTA file with phenotype-specific <italic>k</italic>
-mers, assembled to longer contigs when possible, to facilitate an user to perform annotation process, and (4) a summary of the regression analysis performed (<xref ref-type="fig" rid="pcbi.1006434.g001">Fig 1</xref>
). Optionally, it is possible to use weighting for the strains to take into account the clonal population structure. The weights are based on a distance matrix of strains made with an alignment-free <italic>k</italic>
-mer-based method called Mash [<xref rid="pcbi.1006434.ref012" ref-type="bibr">12</xref>
]. The weights of each genome are calculated using the Gerstein, Sonnhammer and Cothia method [<xref rid="pcbi.1006434.ref013" ref-type="bibr">13</xref>
]. 'PhenotypeSeeker prediction' uses the regression model generated by 'PhenotypeSeeker modeling' to conduct fast phenotype predictions on input samples (<xref ref-type="fig" rid="pcbi.1006434.g001">Fig 1</xref>
). Using gmer_counter from the FastGT package [<xref rid="pcbi.1006434.ref014" ref-type="bibr">14</xref>
], the tool searches the samples only for the <italic>k</italic>
-mers used as parameters in the regression model. Predictions are then made based on the presence or absence of these <italic>k</italic>
-mers.</p>
<fig id="pcbi.1006434.g001" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.1006434.g001</object-id>
<label>Fig 1</label>
<caption><title>Schematic presentation of PhenotypeSeeker workflow.</title>
<p>Panel A shows the 'PhenotypeSeeker modeling' steps, which generate the phenotype prediction model based on the input genomes and their phenotype values. Panel B shows the 'PhenotypeSeeker prediction' steps, which use the previously generated model to predict the phenotypes for input genomes.</p>
</caption>
<graphic xlink:href="pcbi.1006434.g001"></graphic>
</fig>
<p>PhenotypeSeeker uses fixed-length k-mers in all analyses. Thus, the <italic>k</italic>
-mer length is an important factor influencing the overall software performance. The effects of <italic>k</italic>
-mer length on speed, memory usage and accuracy were tested on a <italic>P</italic>
. <italic>aeruginosa</italic>
 ciprofloxacin dataset. A general observation from that analysis is that the CPU time and the PhenotypeSeeker memory usage increase when the <italic>k</italic>
-mer length increases (<xref ref-type="fig" rid="pcbi.1006434.g002">Fig 2</xref>
). Previously described mutations in the <italic>P</italic>
. <italic>aeruginosa parC</italic>
 and <italic>gyrA</italic>
 genes were always detected if the <italic>k</italic>
-mer length was at least 13 nucleotides. We assume that in most cases, a <italic>k</italic>
-mer length of 13 is sufficient to detect biologically relevant mutations, although in certain cases, longer <italic>k</italic>
-mers might provide additional sensitivity. The k-mer length in PhenotypeSeeker is a user-selectable parameter. Although most of the phenotype detection can be performed with the default k-mer value, we suggest experimenting with longer k-mers in the model building phase. All subsequent analyses in this article are performed with a <italic>k</italic>
-mer length of 13, unless specified otherwise.</p>
<fig id="pcbi.1006434.g002" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.1006434.g002</object-id>
<label>Fig 2</label>
<caption><title>The influence of <italic>k</italic>
-mer length on the CPU time and total RAM usage of PhenotypeSeeker (bars, left axis) and on the number of different <italic>k</italic>
-mers present in the genomes (line, right axis).</title>
</caption>
<graphic xlink:href="pcbi.1006434.g002"></graphic>
</fig>
</sec>
<sec id="sec004"><title>Ciprofloxacin resistance phenotype in <italic>Pseudomonas aeruginosa</italic>
</title>
<p>PhenotypeSeeker was applied to the dataset composed of <italic>P</italic>
. <italic>aeruginosa</italic>
 genomes and corresponding ciprofloxacin resistance values measured in terms of minimum inhibitory concentration (MIC) (μg/ml), which is defined as the lowest concentration of antibiotic that will inhibit the visible growth of the isolate under investigation after an appropriate period of incubation [<xref rid="pcbi.1006434.ref015" ref-type="bibr">15</xref>
]. We built two separate models using a continuous phenotype for one and binary phenotype for another. Binary phenotype values were created based on EUCAST ciprofloxacin breakpoints [<xref rid="pcbi.1006434.ref016" ref-type="bibr">16</xref>
]. Both models detected <italic>k</italic>
-mers associated with mutations in quinolone resistance determining regions (QRDR) of the <italic>parC</italic>
 (c.260C>T, p.Ser87Leu) and <italic>gyrA</italic>
 (c.248C>T, p.Thr83Ile) genes (<xref ref-type="fig" rid="pcbi.1006434.g003">Fig 3</xref>
, <xref ref-type="supplementary-material" rid="pcbi.1006434.s005">S2 Table</xref>
). These genes encode DNA topoisomerase IV subunit A and DNA gyrase subunit A, the target proteins of ciprofloxacin [<xref rid="pcbi.1006434.ref017" ref-type="bibr">17</xref>
]. Mutations in the QRDR regions of these genes are well-known causes of decreased sensitivity to quinolone antibiotics, such as ciprofloxacin [<xref rid="pcbi.1006434.ref018" ref-type="bibr">18</xref>
]. The classification model built using a binary phenotype had a F1-measure of 0.88, prediction accuracy of 0.88, sensitivity of 0.90 and specificity of 0.87 on the test subset (Table A in <xref ref-type="supplementary-material" rid="pcbi.1006434.s006">S3 Table</xref>
). The MIC prediction model built using the continuous phenotype had the coefficient of determination (R<sup>2</sup>
) of 0.42, the Pearson correlation coefficient of 0.68 and the Spearman correlation coefficient of 0.84 (Table M in <xref ref-type="supplementary-material" rid="pcbi.1006434.s006">S3 Table</xref>
).</p>
<fig id="pcbi.1006434.g003" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.1006434.g003</object-id>
<label>Fig 3</label>
<caption><title>The positions of ciprofloxacin-resistant <italic>P</italic>
. <italic>aeruginosa</italic>
 strains on cladogram.</title>
<p>The MIC values (mg/l) are marked to the external nodes with corresponding strain names. Strains with MIC > 0.5 mg/l are highlighted with yellow to denote ciprofloxacin resistance according to EUCAST breakpoints [<xref rid="pcbi.1006434.ref016" ref-type="bibr">16</xref>
]. Strains with detected mutations in QRDR of <italic>gyrA</italic>
 and <italic>parC</italic>
 are marked with the color code on the perimeter of the cladogram.</p>
</caption>
<graphic xlink:href="pcbi.1006434.g003"></graphic>
</fig>
</sec>
<sec id="sec005"><title>Azithromycin resistance phenotype in Clostridium difficile</title>
<p>In addition to the <italic>P</italic>
. <italic>aeruginosa</italic>
 dataset, we tested a <italic>C</italic>
. <italic>difficile</italic>
 azithromycin resistance dataset (<xref ref-type="supplementary-material" rid="pcbi.1006434.s005">S2 Table</xref>
) studied using Kover in Drouin et al., 2016 [<xref rid="pcbi.1006434.ref009" ref-type="bibr">9</xref>
]. <italic>ermB</italic>
 and Tn6110 transposon were the sequences known and predicted to be important in an azithromycin resistance model by Kover [<xref rid="pcbi.1006434.ref009" ref-type="bibr">9</xref>
]. <italic>ermB</italic>
 was not located on the transposon Tn6110. PhenotypeSeeker found <italic>k</italic>
-mers for both sequences while using <italic>k</italic>
-mers of length 13 or 16. Tn6110 is a transposon that is over 58 kbp long and contains several protein coding sequences, including 23S rRNA methyltransferase, which is associated with macrolide resistance [<xref rid="pcbi.1006434.ref019" ref-type="bibr">19</xref>
]. The predictive models with all tested <italic>k</italic>
-mer lengths (13, 16 and 18) contained <italic>k</italic>
-mers covering the entire Tn6110 transposon sequence, both in protein coding and non-coding regions. In addition to the 23S rRNA methyltransferase gene, <italic>k</italic>
-mers in all three models were mapped to the recombinase family protein, sensor histidine kinase, ABC transporter permease, TlpA family protein disulfide reductase, endonuclease, helicase and conjugal transfer protein coding regions. The model built for the <italic>C</italic>
. <italic>difficile</italic>
 azithromycin resistance phenotype had a F1-measure of 0.97, prediction accuracy of 0.97, sensitivity of 0.96 and specificity of 0.97 on the test subset (Table A in <xref ref-type="supplementary-material" rid="pcbi.1006434.s006">S3 Table</xref>
).</p>
</sec>
<sec id="sec006"><title>Virulence phenotype in <italic>Klebsiella pneumoniae</italic>
</title>
<p>In addition to antibiotic resistance phenotypes in <italic>P</italic>
. <italic>aeruginosa</italic>
 and <italic>C</italic>
. <italic>difficile</italic>
, we used <italic>K</italic>
. <italic>pneumoniae</italic>
 human infection-causing strains as a different kind of phenotype example. <italic>K</italic>
. <italic>pneumoniae</italic>
 strains contain several genetic loci that are related to virulence. These loci include aerobactin, yersiniabactin, colibactin, salmochelin and microcin siderophore system gene clusters [<xref rid="pcbi.1006434.ref020" ref-type="bibr">20</xref>
–<xref rid="pcbi.1006434.ref024" ref-type="bibr">24</xref>
], the allantoinase gene cluster [<xref rid="pcbi.1006434.ref025" ref-type="bibr">25</xref>
], <italic>rmpA</italic>
 and <italic>rmpA2</italic>
 regulators [<xref rid="pcbi.1006434.ref026" ref-type="bibr">26</xref>
,<xref rid="pcbi.1006434.ref027" ref-type="bibr">27</xref>
], the ferric uptake operon <italic>kfuABC</italic>
 [<xref rid="pcbi.1006434.ref028" ref-type="bibr">28</xref>
] and the two-component regulator <italic>kvgAS</italic>
 [<xref rid="pcbi.1006434.ref029" ref-type="bibr">29</xref>
]. The model predicted by PhenotypeSeeker for invasive/infectious phenotypes included 13-mers representing several of these genes. Genes in colibactin (<italic>clbQ</italic>
 and <italic>clbO</italic>
), aerobactin (<italic>iucB</italic>
 and <italic>iucC</italic>
) and yersiniabactin (<italic>irp1</italic>
, <italic>irp2</italic>
, <italic>fyuA</italic>
, <italic>ybtQ</italic>
, <italic>ybtX</italic>
, and <italic>ybtP</italic>
) clusters showed the most differentiating pattern between carrier and invasive/infectious strains (<xref ref-type="fig" rid="pcbi.1006434.g004">Fig 4</xref>
; <xref ref-type="supplementary-material" rid="pcbi.1006434.s005">S2 Table</xref>
). A 13-mer mapping to a gene-coding capsule assembly protein Wzi was also represented in the model. The model built for <italic>K</italic>
. <italic>pneumoniae</italic>
 invasive/infectious phenotypes had a F1-measure of 0.88, prediction accuracy of 0.88, sensitivity of 0.91 and specificity of 0.78 on the test subset (Table A in <xref ref-type="supplementary-material" rid="pcbi.1006434.s006">S3 Table</xref>
).</p>
<fig id="pcbi.1006434.g004" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.1006434.g004</object-id>
<label>Fig 4</label>
<caption><title>Virulence genes in corresponding clusters and <italic>wzi</italic>
 included in the PhenotypeSeeker prediction model in <italic>K</italic>
. <italic>pneumoniae</italic>
 strains (13-mers, weighted, max. 10 000 <italic>k</italic>
-mers for the regression model).</title>
<p>Each row is one strain, and each column represents one protein coding gene. Blue cells represent 13-mers in the model for the corresponding gene and a strain. Genes in colibactin, aerobactin and yersiniabactin clusters show the most differentiating pattern between carrier and invasive/infectious strains.</p>
</caption>
<graphic xlink:href="pcbi.1006434.g004"></graphic>
</fig>
</sec>
<sec id="sec007"><title>Classification accuracy and running time</title>
<p>To measure the average classification accuracies of logistic regression models, all three datasets were divided into a training and test set of approximately 75% and 25% of strains respectively. A <italic>K</italic>
-mer length of 13 was used, and a weighted approach was tested on binary phenotypes (<xref rid="pcbi.1006434.t001" ref-type="table">Table 1</xref>
). To reduce the influence of sequencing errors when using sequencing reads instead of assembled contigs as the input, we only counted 13-mers as being present in one of the input lists if they occurred at least 5 times in that input list. The PhenotypeSeeker prediction accuracy is not lower when using raw sequencing reads instead of assembled genomes, and therefore, assembly building is not required before model building. Our results with <italic>K</italic>
. <italic>pneumoniae</italic>
 show that PhenotypeSeeker can be successfully applied to other kinds of phenotypes in addition to antibiotic resistance.</p>
<table-wrap id="pcbi.1006434.t001" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.1006434.t001</object-id>
<label>Table 1</label>
<caption><title>Model’s F1-measure and running time.</title>
<p>The results with 13-mers and weighting are shown. The maximum number of 13-mers selected for the regression model was 1000. In cases where sequencing reads were used as the input, a minimum frequency of 5 for a 13-mer was required to reduce the influence of sequencing errors.</p>
</caption>
<alternatives><graphic id="pcbi.1006434.t001g" xlink:href="pcbi.1006434.t001"></graphic>
<table frame="hsides" rules="groups"><colgroup span="1"><col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
</colgroup>
<thead><tr><th align="center" rowspan="2" colspan="1">Dataset<break></break>
</th>
<th align="center" rowspan="2" colspan="1">F1-measure</th>
<th align="center" colspan="2" rowspan="1">Number of isolates</th>
<th align="center" rowspan="2" colspan="1">Time for the model building (per model)</th>
<th align="center" rowspan="2" colspan="1">Time for the phenotype prediction (per phenotype)<break></break>
</th>
</tr>
<tr><th align="center" rowspan="1" colspan="1">Training</th>
<th align="center" rowspan="1" colspan="1">Testing</th>
</tr>
</thead>
<tbody><tr><td align="center" rowspan="1" colspan="1"><italic>Pseudomonas aeruginosa</italic>
 (contigs)</td>
<td align="center" rowspan="1" colspan="1">0.88</td>
<td align="center" rowspan="1" colspan="1">150</td>
<td align="center" rowspan="1" colspan="1">50</td>
<td align="center" rowspan="1" colspan="1">3h 36m</td>
<td align="center" rowspan="1" colspan="1">0.81s</td>
</tr>
<tr><td align="center" rowspan="1" colspan="1"><italic>Pseudomonas aeruginosa</italic>
 (reads)</td>
<td align="center" rowspan="1" colspan="1">0.88</td>
<td align="center" rowspan="1" colspan="1">150</td>
<td align="center" rowspan="1" colspan="1">50</td>
<td align="center" rowspan="1" colspan="1">19h 56m</td>
<td align="center" rowspan="1" colspan="1">58.0s</td>
</tr>
<tr><td align="center" rowspan="1" colspan="1"><italic>Klebsiella pneumoniae</italic>
 (contigs)</td>
<td align="center" rowspan="1" colspan="1">0.88</td>
<td align="center" rowspan="1" colspan="1">125</td>
<td align="center" rowspan="1" colspan="1">42</td>
<td align="center" rowspan="1" colspan="1">3h 38m</td>
<td align="center" rowspan="1" colspan="1">0.7s</td>
</tr>
<tr><td align="center" rowspan="1" colspan="1"><italic>Klebsiella pneumoniae</italic>
 (reads)</td>
<td align="center" rowspan="1" colspan="1">0.88</td>
<td align="center" rowspan="1" colspan="1">125</td>
<td align="center" rowspan="1" colspan="1">42</td>
<td align="center" rowspan="1" colspan="1">10h 3m</td>
<td align="center" rowspan="1" colspan="1">28.0s</td>
</tr>
<tr><td align="center" rowspan="1" colspan="1"><italic>Clostridium difficile</italic>
 (contigs)</td>
<td align="center" rowspan="1" colspan="1">0.97</td>
<td align="center" rowspan="1" colspan="1">345</td>
<td align="center" rowspan="1" colspan="1">115</td>
<td align="center" rowspan="1" colspan="1">4h 50m</td>
<td align="center" rowspan="1" colspan="1">0.61s</td>
</tr>
<tr><td align="center" rowspan="1" colspan="1"><italic>Pseudomonas aeruginosa</italic>
 (contigs)</td>
<td align="center" rowspan="1" colspan="1">0.88</td>
<td align="center" rowspan="1" colspan="1">150</td>
<td align="center" rowspan="1" colspan="1">50</td>
<td align="center" rowspan="1" colspan="1">3h 36m</td>
<td align="center" rowspan="1" colspan="1">0.81s</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<p>In our trials, the model building on a given dataset took 3 to 5 hours per phenotype, and prediction of the phenotype took less than a second on assembled genomes (<xref rid="pcbi.1006434.t001" ref-type="table">Table 1</xref>
). The CPU time of model building by PhenotypeSeeker depends mainly on the number of different <italic>k</italic>
-mers in genomes of the training set. The analysis performed on our 200 <italic>P</italic>
. <italic>aeruginosa</italic>
 genomes showed that the CPU time of the model building grows linearly with the number of genomes given as input (<xref ref-type="supplementary-material" rid="pcbi.1006434.s001">S1 Fig</xref>
).</p>
<p>The memory requirement of PhenotypeSeeker did not exceed 2 GB if default parameter settings are used, allowing us to run analyses on laptop computers (<xref ref-type="supplementary-material" rid="pcbi.1006434.s002">S2 Fig</xref>
) if necessary. The p-value cut-offs during the <italic>k</italic>
-mer filtering step influence the number of <italic>k</italic>
-mers included in the model and have a potentially strong impact on model performance. Tables A-E in the <xref ref-type="supplementary-material" rid="pcbi.1006434.s004">S1 Table</xref>
 show the effects of different p-value cut-offs on model performances.</p>
</sec>
<sec id="sec008"><title>Comparison with other software</title>
<p>We ran SEER and Kover on the same <italic>P</italic>
. <italic>aeruginosa</italic>
 ciprofloxacin dataset and <italic>C</italic>
. <italic>difficile</italic>
 azithromycin resistance dataset to compare the efficiency and CPU time usage with PhenotypeSeeker.</p>
<p>In the <italic>P</italic>
. <italic>aeruginosa</italic>
 dataset, SEER was able to detect <italic>gyrA</italic>
 and <italic>parC</italic>
 mutations only when resistance was defined as a binary phenotype. In cases with a continuous phenotype, those <italic>k</italic>
-mers did not pass the p-value filtering step. Since Kover's aim is to create a resistance predicting model, not an exhaustive list of significant <italic>k</italic>
-mers, it was expected that not all the mutations would be described in the output. <italic>gyrA</italic>
 variation already sufficiently characterized the resistant strains set, and therefore, <italic>parC</italic>
 mutations were not included in the model. The same applies to the PhenotypeSeeker results with 16- and 18-mers. <italic>parC</italic>
-specific 16- or 18-mers were included among the 1000 <italic>k</italic>
-mers in the prediction model (based on statistically significant p-values) but with the regression coefficient equal to zero because they were present in the same strains as <italic>gyrA</italic>
 specific predictive <italic>k</italic>
-mers.</p>
<p>In the <italic>C</italic>
. <italic>difficile</italic>
 dataset, our model included the known resistance gene <italic>ermB</italic>
 and transposon Tn6110. We were able to find <italic>ermB</italic>
 with both SEER and Kover. We also detected Tn6110-specific <italic>k</italic>
-mers with SEER while running Kover with 16-mers instead of 31-mers as in the default settings.</p>
<p>Regarding the CPU time, PhenotypeSeeker with 13-mers was faster than other tested software programs (3.5 hrs vs 14–15 hrs) without losing the relevant markers in the output (<xref rid="pcbi.1006434.t002" ref-type="table">Table 2</xref>
). Using 16- or 18-mers, the PhenotypeSeeker’s running time increases but is still lower than with SEER and Kover.</p>
<table-wrap id="pcbi.1006434.t002" orientation="portrait" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.1006434.t002</object-id>
<label>Table 2</label>
<caption><title>PhenotypeSeeker comparison to Kover and SEER using <italic>P</italic>
. <italic>aeruginosa and C</italic>
. <italic>difficile</italic>
 data.</title>
<p>PhenotypeSeeker with the weighting option and maximum 1000 k-mers for the regression model was used.</p>
</caption>
<alternatives><graphic id="pcbi.1006434.t002g" xlink:href="pcbi.1006434.t002"></graphic>
<table frame="hsides" rules="groups"><colgroup span="1"><col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
</colgroup>
<thead><tr><th align="center" colspan="2" rowspan="2"></th>
<th align="center" colspan="3" rowspan="1"><italic>Pseudomonas aeruginosa</italic>
 (200 genomes)</th>
<th align="center" colspan="3" rowspan="1"><italic>Clostridium difficile</italic>
 (459 genomes)</th>
</tr>
<tr><th align="center" colspan="2" rowspan="1">Previously known CIP resistance mutations detected</th>
<th align="center" rowspan="1" colspan="1"></th>
<th align="center" colspan="2" rowspan="1">Previously known AZM resistance genes<xref ref-type="table-fn" rid="t002fn001">*</xref>
 detected</th>
<th align="center" rowspan="1" colspan="1"></th>
</tr>
<tr><th align="center" rowspan="1" colspan="1">Software</th>
<th align="center" rowspan="1" colspan="1"><italic>k-</italic>
mer length</th>
<th align="center" rowspan="1" colspan="1"><italic>gyrA</italic>
<break></break>
c.248C>T</th>
<th align="center" rowspan="1" colspan="1"><italic>parC</italic>
 c.260C >T</th>
<th align="center" rowspan="1" colspan="1">Time for model building</th>
<th align="center" rowspan="1" colspan="1"><italic>ermB</italic>
</th>
<th align="center" rowspan="1" colspan="1">Tn6110 transposon</th>
<th align="center" rowspan="1" colspan="1">Time for model building</th>
</tr>
</thead>
<tbody><tr><td align="center" rowspan="1" colspan="1">Phenotype<break></break>
Seeker</td>
<td align="center" rowspan="1" colspan="1">13</td>
<td align="center" rowspan="1" colspan="1">+</td>
<td align="center" rowspan="1" colspan="1">+</td>
<td align="center" rowspan="1" colspan="1">3h 36m</td>
<td align="center" rowspan="1" colspan="1">+</td>
<td align="center" rowspan="1" colspan="1">+</td>
<td align="center" rowspan="1" colspan="1">4h 47m</td>
</tr>
<tr><td align="center" rowspan="1" colspan="1">Phenotype<break></break>
Seeker</td>
<td align="center" rowspan="1" colspan="1">16</td>
<td align="center" rowspan="1" colspan="1">+</td>
<td align="center" rowspan="1" colspan="1">-</td>
<td align="center" rowspan="1" colspan="1">6h 51m</td>
<td align="center" rowspan="1" colspan="1">+</td>
<td align="center" rowspan="1" colspan="1">+</td>
<td align="center" rowspan="1" colspan="1">9h 7m</td>
</tr>
<tr><td align="center" rowspan="1" colspan="1">Phenotype<break></break>
Seeker</td>
<td align="center" rowspan="1" colspan="1">18</td>
<td align="center" rowspan="1" colspan="1">+</td>
<td align="center" rowspan="1" colspan="1">-</td>
<td align="center" rowspan="1" colspan="1">7h 31m</td>
<td align="center" rowspan="1" colspan="1">-</td>
<td align="center" rowspan="1" colspan="1">+</td>
<td align="center" rowspan="1" colspan="1">9h 58m</td>
</tr>
<tr><td align="center" rowspan="1" colspan="1">Kover</td>
<td align="center" rowspan="1" colspan="1">16</td>
<td align="center" rowspan="1" colspan="1">+</td>
<td align="center" rowspan="1" colspan="1">-</td>
<td align="center" rowspan="1" colspan="1">14h 14m</td>
<td align="center" rowspan="1" colspan="1">+</td>
<td align="center" rowspan="1" colspan="1">+</td>
<td align="center" rowspan="1" colspan="1">14h 10 m</td>
</tr>
<tr><td align="center" rowspan="1" colspan="1">Kover</td>
<td align="center" rowspan="1" colspan="1">31</td>
<td align="center" rowspan="1" colspan="1">+</td>
<td align="center" rowspan="1" colspan="1">-</td>
<td align="center" rowspan="1" colspan="1">14h 46m</td>
<td align="center" rowspan="1" colspan="1">+</td>
<td align="center" rowspan="1" colspan="1">-</td>
<td align="center" rowspan="1" colspan="1">13h 40m</td>
</tr>
<tr><td align="center" rowspan="1" colspan="1">SEER</td>
<td align="center" rowspan="1" colspan="1">9–100</td>
<td align="center" rowspan="1" colspan="1">+</td>
<td align="center" rowspan="1" colspan="1">+</td>
<td align="center" rowspan="1" colspan="1">15h 7m</td>
<td align="center" rowspan="1" colspan="1">+</td>
<td align="center" rowspan="1" colspan="1">+</td>
<td align="center" rowspan="1" colspan="1">15h 32m</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot><fn id="t002fn001"><p>* As reported in Drouin <italic>et al</italic>
. 2016 [<xref rid="pcbi.1006434.ref009" ref-type="bibr">9</xref>
]</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
<sec sec-type="conclusions" id="sec009"><title>Discussion</title>
<p>PhenotypeSeeker works as an easy-to-use application to list the candidate biomarkers behind a studied bacterial phenotype and to create a predictive model. Based on <italic>k</italic>
-mers, PhenotypeSeeker does not require a reference genome and is therefore also usable for species with very high intraspecific variation where the selection of one genome as a reference can be complicated.</p>
<p>PhenotypeSeeker supports both discrete and continuous phenotypes as inputs. In addition, this model takes into account the population structure to highlight only the possible causal variations and not the mutations arising from the clonal nature of bacterial populations.</p>
<p>Unlike Kover, the PhenotypeSeeker output is not merely a trained model for predicting resistance in a separate set of isolates, but the complete list of statistically significant candidate variations separating antibiotic resistant and susceptible isolates for further biological interpretation is also provided. Unlike SEER, PhenotypeSeeker is easier to install and can be run with only a single command for building a model and another single command to use it for prediction.</p>
<p>Our tests using PhenotypeSeeker to detect antibiotic resistance markers in <italic>P</italic>
. <italic>aeruginosa</italic>
 and <italic>C</italic>
. <italic>difficile</italic>
 showed that it is capable of detecting all previously known mutations in a reasonable amount of time and with a relatively short <italic>k</italic>
-mer length. Users can choose the <italic>k</italic>
-mer length as well as decide whether to use the population structure correction step. Due to the clonal nature of bacterial populations, this step is highly advised for detecting genuine causal variations instead of strain-level differences. In addition to a trained predictive model, the list of <italic>k</italic>
-mers covering possible variations related to the phenotype are produced for further interpretation by the user. The effectiveness of the model can vary because of the nature of different phenotypes in different bacterial species. Simple forms of antibiotic resistance that are unambiguously determined by one or two specific mutations or the insertion of a gene are likely to be successfully detected by our method, and effective predictive models for subsequent phenotype predictions can be created. This is supported by our prediction accuracy over 96% in the <italic>C</italic>
. <italic>difficile</italic>
 dataset. On the other hand, <italic>P</italic>
. <italic>aeruginosa</italic>
 antibiotic resistance is one of the most complicated phenotypes among clinically relevant pathogens since it is not often easily described by certain single nucleotide mutations in one gene but rather through a complex system involving several genes and their regulators leading to multi-resistant strains. In cases such as this, the prediction is less accurate (88% in our dataset), but nevertheless, a complete list of <italic>k</italic>
-mers covering differentiating markers between resistant and sensitive strains can provide more insight into the actual resistance mechanisms and provide candidates for further experimental testing.</p>
<p>Tests with <italic>K</italic>
. <italic>pneumoniae</italic>
 virulence phenotypes showed that PhenotypeSeeker is not limited to antibiotic resistance phenotypes but is potentially applicable to other measurable phenotypes as well and is therefore usable in a wider range of studies.</p>
<p>Since PhenotypeSeeker input is not restricted to assembled genomes, one can skip the assembly step and calculate models based on raw read data. In this case, it should be taken into account that sequencing errors may randomly generate phenotype-specific k-mers; thus, we suggest using the built-in option to remove low frequency <italic>k</italic>
-mers. The <italic>k</italic>
-mer frequency cut-off threshold depends on the sequencing coverage of the genomes and is therefore implemented as user-selectable. One can also build the model based on high-quality assembled genomes and then use the model for corresponding phenotype prediction on raw sequencing data.</p>
</sec>
<sec sec-type="materials|methods" id="sec010"><title>Methods</title>
<sec id="sec011"><title>Data</title>
<p>PhenotypeSeeker was tested on the following three bacterial species: <italic>Pseudomonas aeruginosa</italic>
, <italic>Clostridium difficile</italic>
 and <italic>Klebsiella pneumoniae</italic>
. The <italic>P</italic>
. <italic>aeruginosa</italic>
 dataset was composed of 200 assembled genomes and the minimal inhibitory concentration measurements (MICs) for ciprofloxacin. The <italic>P</italic>
. <italic>aeruginosa</italic>
 strains were isolated during the project Transfer routes of antibiotic resistance (ABRESIST) performed as part of the Estonian Health Promotion Research Programme (TerVE) implemented by the Estonian Research Council, the Ministry of Agriculture (now the Ministry of Rural Affairs), and the National Institute for Health Development. Isolated strains originated from humans, animals and the environment within the same geographical location in Estonia and belonged to 103 different MLST sequence types (Laht et al., <italic>Pseudomonas aeruginosa</italic>
 distribution among humans, animals and the environment (submitted); Telling et al., Multidrug resistant <italic>Pseudomonas aeruginosa</italic>
 in Estonian hospitals (submitted)). Full genomes were sequenced by Illumina HiSeq2500 (Illumina, San Diego, USA) with paired-end, 150 bp reads (Nextera XT libraries) and de novo assembled with the program SPAdes (ver 3.5.0) [<xref rid="pcbi.1006434.ref030" ref-type="bibr">30</xref>
]. MICs were determined by using the epsilometer test (E-test, bioMérieux, Marcy l'Etoile, France) according to the manufacturer instructions. Binary phenotypes were achieved by converting the MIC values into 0 (sensitive) and 1 (resistant) phenotypes according to the European Committee on Antimicrobial Susceptibility Testing (EUCAST) breakpoints [<xref rid="pcbi.1006434.ref016" ref-type="bibr">16</xref>
]. The resulted dataset consisted of 124 ciprofloxacin sensitive <italic>P</italic>
. <italic>aeruginosa</italic>
 isolates (62%) and 76 ciprofloxacin resistant <italic>P</italic>
. <italic>aeruginosa</italic>
 isolates (38%) and is deposited in the NCBI’s BioProject database under the accession number PRJNA244279 (<ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA244279">https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA244279</ext-link>
).</p>
<p>The <italic>C</italic>
. <italic>difficile</italic>
 dataset was composed of assembled genomes of 459 isolates and the binary phenotypes of azithromycin resistance (sensitive = 0 vs resistant = 1), adapted from Drouin et al., 2016 [<xref rid="pcbi.1006434.ref009" ref-type="bibr">9</xref>
]. The isolates originated from patients from different hospitals in the province of Quebec, Canada and the genomes were received from the European Nucleotide Archive [EMBL:PRJEB11776 ((<ext-link ext-link-type="uri" xlink:href="http://www.ebi.ac.uk/ena/data/view/PRJEB11776">http://www.ebi.ac.uk/ena/data/view/PRJEB11776</ext-link>
)]. The dataset consisted of 246 azithromycin sensitive isolates (54%) and 213 azithromycin resistant isolates (46%).</p>
<p>The <italic>K</italic>
. <italic>pneumoniae</italic>
 dataset was composed of reads of 167 isolates, originating from six countries and sampled to maximize diversity, and the binary clinical phenotype of human carriage status vs human infection (including invasive infections) status (carriage = 0 vs infectious = 1), adapted from Holt et al., 2015 [<xref rid="pcbi.1006434.ref031" ref-type="bibr">31</xref>
]. The reads were received from the European Nucleotide Archive [EMBL:PRJEB2111 (<ext-link ext-link-type="uri" xlink:href="https://www.ebi.ac.uk/ena/data/view/PRJEB2111">https://www.ebi.ac.uk/ena/data/view/PRJEB2111</ext-link>
)] and de novo assembled with SPAdes (ver 3.10.1) [<xref rid="pcbi.1006434.ref030" ref-type="bibr">30</xref>
]. The dataset consisted of 36 isolates with human carriage status as phenotype (22%) and 131 <italic>K</italic>
. <italic>pneumonia</italic>
 isolates with human infection status as phenotype (78%).</p>
<p>Abstractly, each test dataset was composed of pairs (x, y), where x is the bacterial genome x∈{A,T,G,C}*, and y denotes phenotype values specific to a given dataset y ∈ {0.008, …, 1024} (continuous phenotype) or y ∈ {0, 1} (binary phenotype).</p>
</sec>
<sec id="sec012"><title>Compilation of <italic>k</italic>
-mer lists</title>
<p>All operations with <italic>k</italic>
-mers are performed using the GenomeTester4 software package containing the glistmaker, glistquery and glistcompare programs [<xref rid="pcbi.1006434.ref011" ref-type="bibr">11</xref>
]. At first, all <italic>k</italic>
-mers from all samples are counted with glistmaker, which takes either FASTA or FASTQ files as an input and enables us to set the <italic>k</italic>
-mer length up to 32 nucleotides. Subsequently, the <italic>k</italic>
-mers are filtered based on their frequency in strains of the training set. By default, the <italic>k</italic>
-mers that are present in or missing from less than two samples are filtered out and not used in building the model. The remaining <italic>k</italic>
-mers are used in statistical testing for detection of association with the phenotype.</p>
</sec>
<sec id="sec013"><title>Weighting</title>
<p>By default, PhenotypeSeeker conducts the clonal population structure correction step by using a sequence weighting approach that reduces the weight of isolates with closely related genomes. For weighting, pairwise distances between genomes of the training set are calculated using the free alignment software Mash with default parameters (<italic>k</italic>
-mer size of 21 nucleotides and sketch size of 1000 min-hasehes) [<xref rid="pcbi.1006434.ref012" ref-type="bibr">12</xref>
]. Distances estimated by Mash are subsequently used to calculate weights for each genome according to the algorithm proposed by Gerstein, Sonnhammer and Chothia [<xref rid="pcbi.1006434.ref013" ref-type="bibr">13</xref>
]. The calculation of GSC weights is conducted using the PyCogent python package [<xref rid="pcbi.1006434.ref032" ref-type="bibr">32</xref>
]. The GSC weights are taken into account while calculating Welch two-sample t-tests or chi-squared tests to test the <italic>k</italic>
-mers’ associations with the phenotype. Additionally, the GSC weights can be used in the final logistic regression or linear regression (if Ridge regularization is used) model generation.</p>
</sec>
<sec id="sec014"><title>Chi-squared test</title>
<p>In the case of binary phenotype input, the chi-squared test is applied to every <italic>k</italic>
-mer that passes the frequency filtration to determine the <italic>k</italic>
-mer association with phenotype. The null hypothesis assumes that there is no association between <italic>k</italic>
-mer presence and phenotype. The alternative hypothesis assumes that the <italic>k</italic>
-mer is associated with phenotype. The chi-squared test is conducted on these observed and expected values with degrees of freedom = 1, using the scipy.stats Python package [<xref rid="pcbi.1006434.ref033" ref-type="bibr">33</xref>
]. If the user selects to use the population structure correction step, then the weighted chi-squared tests are conducted according to the previously published method [<xref rid="pcbi.1006434.ref034" ref-type="bibr">34</xref>
].</p>
</sec>
<sec id="sec015"><title>Welch two-sample t-test</title>
<p>In the case of continuous phenotype input, the Welch two-sample t-test is applied to every <italic>k</italic>
-mer that passes the frequency filtration to determine if the mean phenotype values of strains having the <italic>k</italic>
-mer are different from the mean phenotype values of strains that do not have the <italic>k</italic>
-mer. The null hypothesis assumes that the strains with a <italic>k</italic>
-mer have different mean phenotype values from the strains without the <italic>k</italic>
-mer. The alternative hypothesis assumes that the means of the strains with and without the <italic>k</italic>
-mer are the same. The t-test is conducted with these values using the scipy.stats Python package [<xref rid="pcbi.1006434.ref033" ref-type="bibr">33</xref>
], assuming that the samples are independent and have different variance. If the user selects the population structure correction step, then the weighted t-tests are conducted [<xref rid="pcbi.1006434.ref034" ref-type="bibr">34</xref>
]. In that case, the p-value is calculated with the function scipy.stats.t.sf, which takes the absolute value of the t-statistic and the value of degrees of freedom as the input.</p>
</sec>
<sec id="sec016"><title>Regression analysis</title>
<p>To perform the regression analysis, first, the matrix of samples times features is created. The samples in this matrix are strains given as the input and the features represent the <italic>k</italic>
-mers that are selected for the regression analysis. The values (0 or 1) in this matrix represent the presence or absence of a specific <italic>k</italic>
-mer in the specific strain. The target variables of this regression analysis are the resistance values of the strains. Thereupon, input data are divided into training and test sets whose sizes are by default 75% and 25% of the strains, respectively. The proportion of class labels in the training and test sets are kept the same as in the original undivided dataset. In the case of a continuous phenotype, a linear regression model is built, and in the case of a binary phenotype, a logistic regression model is built. The logistic regression was selected for binary classification task as it showed better performance on our datasets than other tested machine learning classifiers like support vector machine (with no kernel and with Gaussian kernel) and random forest. The performance of logistic regression models on our tested datasets in comparison to performance of other machine learning classifiers are shown in <xref ref-type="supplementary-material" rid="pcbi.1006434.s003">S3 Fig</xref>
 and in Tables A-L in <xref ref-type="supplementary-material" rid="pcbi.1006434.s006">S3 Table</xref>
. The performance of linear regression model on <italic>P</italic>
. <italic>aeruginosa</italic>
 dataset is shown in Table M in <xref ref-type="supplementary-material" rid="pcbi.1006434.s006">S3 Table</xref>
. For both the linear and logistic regression, the Lasso, Ridge or Elastic Net regularization can be selected. The Lasso and Elastic Net regularizations shrink the coefficients of non-relevant features to zero, which simplifies the identification of k-mers that have the strongest association with the phenotype. To enable the evaluation of the output regression model, PhenotypeSeeker provides model-evaluation metrics. For the logistic regression model quality, PhenotypeSeeker provides the mean accuracy as the percentage of correctly classified instances across both classes (0 and 1). Additionally, PhenotypeSeeker provides F1-score, precision, recall, sensitivity, specificity, AUC-ROC, average precision (area under the precision-recall curve), Matthews correlation coefficient (MCC), Cohen’s kappa, very major error rate and major error rate as metrics to assess model performance. For the linear regression model, PhenotypeSeeker provides the mean squared error, the coefficient of determination (R<sup>2</sup>
), the Pearson and the Spearman correlation coefficients and the within ±1 two-fold dilution factor accuracy (useful for evaluating the MIC predictions) as metrics to assess model performance. To select for the best regularization parameter alpha, a k-fold cross-validation on the training data is performed. By default, 25 alpha values spaced evenly on a log scale from 1E-6 to 1E6 are tested with 10-fold cross-validation and the model with the best mean accuracy (logistic regression) or with the best coefficient of determination (linear regression) is saved to the output file. Regression analysis is conducted using the sklearn.linear_model Python package [<xref rid="pcbi.1006434.ref035" ref-type="bibr">35</xref>
].</p>
</sec>
<sec id="sec017"><title>Parameters used for training and testing</title>
<p>Our models were created using mainly k-mer length 13 (“-l 13”; default). We counted the k-mers that occurred at least once per sample (“-c 1”; default) when the analysis was performed on contigs or at least five times per sample (“-c 5”) when the analysis was performed on raw reads. In the first filtering step, we filtered out the k-mers that were present in or missing from less than two samples (“—min 2—max 2”; default) when the analysis was performed on a binary phenotype or fewer than ten samples (“—min 10—max N-10”; N–total number of samples) when the analysis was performed on a continuous phenotype. In the next filtering step, we filtered out the k-mers with a statistical test p-value larger than 0.05 (“—p_value 0.05”; default).</p>
<p>The regression analysis was performed with a maximum of 1000 lowest p-valued k-mers (“—n_kmers; 1000”; default) when the analysis was done with binary phenotype and with a maximum of 10,000 lowest p-valued k-mers (“—n_kmers 10000”; default) when the analysis was performed with a continuous phenotype. For regression analyses, we split our datasets into training (75%) and test (25%) sets (“-s 0.25”; default). The regression analyses were conducted using Lasso regularization (“-r L1”; default), and the best regularization parameter was picked from the 25 regularization parameters spaced evenly on a log scale from 1E-6 to 1E6 (“—n_alphas 25—alpha_min 1E-6—alpha_max 1E6”; default). The model performances with each regularization parameter were evaluated by cross-validation with 10-folds (“—n_splits 10”; default).</p>
<p>The correction for clonal population structure (“—weights +”; default) and assembly of k-mers used in the regression model (“—assembly +”; default) were conducted in all our analyses.</p>
</sec>
<sec id="sec018"><title>Comparison to existing software</title>
<p>SEER was installed and run on a local server with 32 CPU cores and 512 GB RAM, except the final step, which we were not able to finish without segmentation fault. This last SEER step was launched via VirtualBox in <ext-link ext-link-type="ftp" xlink:href="ftp://ftp.sanger.ac.uk/pub/pathogens/pathogens-vm/pathogens-vm.latest.ova">ftp://ftp.sanger.ac.uk/pub/pathogens/pathogens-vm/pathogens-vm.latest.ova</ext-link>
. Both binary and continuous phenotypes were tested for <italic>P</italic>
. <italic>aeruginosa</italic>
 and the binary phenotype in <italic>C</italic>
. <italic>difficile</italic>
 cases. Default settings were used. Kover was installed on a local server and used with the settings suggested by the authors in the program tutorial.</p>
</sec>
</sec>
<sec sec-type="supplementary-material" id="sec019"><title>Supporting information</title>
<supplementary-material content-type="local-data" id="pcbi.1006434.s001"><label>S1 Fig</label>
<caption><title>Relationship between the number of input genomes and the CPU time.</title>
<p>The PhenotypeSeeker CPU time depends mainly on the number of different k-mers in input genomes and on computations made with every genome. The analysis performed on our 200 P. aeruginosa genomes showed that the PhenotypeSeeker CPU time has a good linear relationship (R2 = 0.997) with the number of genomes given as input. Although the number of k-mers grows logarithmically with the number of genomes given as input, the linear relationship is because some of the computations made with every genome are more time-consuming when there are larger numbers of different k-mers present in the input genomes.</p>
<p>(TIFF)</p>
</caption>
<media xlink:href="pcbi.1006434.s001.tiff"><caption><p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi.1006434.s002"><label>S2 Fig</label>
<caption><title>Relationship between the number of input genomes and RAM memory usage.</title>
<p>The maximum resident set size of PhenotypeSeeker increases in steps with the number of genomes that are given as the input for model training. This is due to the fact that the maximum resident set size of PhenotypeSeeker is defined by the size of the Python dictionary object into which all different k-mers and their frequencies in genomes are stored. The Python dictionary uses a hash table implementation, and the size of the hash table doubles when it is two thirds full. Therefore, when more genomes are analyzed, more different k-mers are stored into the hash table, and if a certain threshold is exceeded, the next step in the maximum resident set size is taken. However, if the regression is performed with a large number of k-mers, the regression could easily become the most memory using part of the analysis as the data matrix (k-mers x samples), read into memory, grows larger (analysis with 150, 170, 180, 190 and 200 genomes).</p>
<p>(TIFF)</p>
</caption>
<media xlink:href="pcbi.1006434.s002.tiff"><caption><p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi.1006434.s003"><label>S3 Fig</label>
<caption><title>The confusion matrices of different classification models on the datasets.</title>
<p>(A) The confusion matrices of classification models on contigs (N <italic>k</italic>
-mers = 1,000). (B) The confusion matrices of classification models on reads (N <italic>k</italic>
-mers = 1,000). (C) The confusion matrices of classification models on contigs (N <italic>k</italic>
-mers = 10,000). (D) The confusion matrices of classification models on reads (N <italic>k</italic>
-mers = 10,000). (E) The confusion matrices of classification models on contigs (N <italic>k</italic>
-mers = 100,000). (F) The confusion matrices of classification models on reads (N <italic>k</italic>
-mers = 100,000).</p>
<p>(PDF)</p>
</caption>
<media xlink:href="pcbi.1006434.s003.pdf"><caption><p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi.1006434.s004"><label>S1 Table</label>
<caption><title>The effects of different p-value cut-offs on model performances.</title>
<p>(PDF)</p>
</caption>
<media xlink:href="pcbi.1006434.s004.pdf"><caption><p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi.1006434.s005"><label>S2 Table</label>
<caption><title>Phylogenetic trees and isolate specific information of the studied <italic>P. aeruginosa, C. difficile</italic>
 and <italic>K. pneumoniae</italic>
 isolates.</title>
<p>(XLSX)</p>
</caption>
<media xlink:href="pcbi.1006434.s005.xlsx"><caption><p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi.1006434.s006"><label>S3 Table</label>
<caption><title>The performance of different machine learning models on the datasets.</title>
<p>(PDF)</p>
</caption>
<media xlink:href="pcbi.1006434.s006.pdf"><caption><p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back><ack><p>The authors are grateful to Triinu Kõressaar for her invaluable suggestions toward improvement of the manuscript.</p>
</ack>
<ref-list><title>References</title>
<ref id="pcbi.1006434.ref001"><label>1</label>
<mixed-citation publication-type="journal"><name><surname>Kisand</surname>
<given-names>V</given-names>
</name>
, <name><surname>Lettieri</surname>
<given-names>T</given-names>
</name>
. <article-title>Genome sequencing of bacteria: sequencing, de novo assembly and rapid analysis using open source tools</article-title>
. <source>BMC Genomics</source>
 [Internet]. <year>2013</year>
;<volume>14</volume>
(<issue>1</issue>
):<fpage>1</fpage>
. Available from: BMC Genomics<pub-id pub-id-type="pmid">23323973</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref002"><label>2</label>
<mixed-citation publication-type="journal"><name><surname>Crofts</surname>
<given-names>TS</given-names>
</name>
, <name><surname>Gasparrini</surname>
<given-names>AJ</given-names>
</name>
, <name><surname>Dantas</surname>
<given-names>G</given-names>
</name>
. <article-title>Next-generation approaches to understand and combat the antibiotic resistome</article-title>
. <source>Nat Rev Microbiol</source>
 [Internet]. <year>2017</year>
;<volume>15</volume>
(<issue>7</issue>
):<fpage>422</fpage>
–<lpage>34</lpage>
. Available from: <pub-id pub-id-type="doi">10.1038/nrmicro.2017.28</pub-id>
<pub-id pub-id-type="pmid">28392565</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref003"><label>3</label>
<mixed-citation publication-type="journal"><name><surname>Bakour</surname>
<given-names>S</given-names>
</name>
, <name><surname>Sankar</surname>
<given-names>SA</given-names>
</name>
, <name><surname>Rathored</surname>
<given-names>J</given-names>
</name>
, <name><surname>Biagini</surname>
<given-names>P</given-names>
</name>
, <name><surname>Raoult</surname>
<given-names>D</given-names>
</name>
, <name><surname>Fournier</surname>
<given-names>P-E</given-names>
</name>
. <article-title>Identification of virulence factors and antibiotic resistance markers using bacterial genomics</article-title>
. <source>Future Microbiol</source>
 [Internet]. <year>2016</year>
;<volume>11</volume>
(<issue>3</issue>
):<fpage>455</fpage>
–<lpage>66</lpage>
. Available from: <ext-link ext-link-type="uri" xlink:href="http://www.futuremedicine.com/doi/10.2217/fmb.15.149">http://www.futuremedicine.com/doi/10.2217/fmb.15.149</ext-link>
<pub-id pub-id-type="doi">10.2217/fmb.15.149</pub-id>
<pub-id pub-id-type="pmid">26974504</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref004"><label>4</label>
<mixed-citation publication-type="journal"><name><surname>Wheeler</surname>
<given-names>NE</given-names>
</name>
, <name><surname>Gardner</surname>
<given-names>PP</given-names>
</name>
, <name><surname>Barquist</surname>
<given-names>L</given-names>
</name>
. <article-title>Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella enterica</article-title>
. <source>PLOS Genet</source>
 [Internet]. <year>2018</year>
;<volume>14</volume>
(<issue>5</issue>
):<fpage>e1007333</fpage>
 Available from: <ext-link ext-link-type="uri" xlink:href="http://dx.plos.org/10.1371/journal.pgen.1007333">http://dx.plos.org/10.1371/journal.pgen.1007333</ext-link>
<pub-id pub-id-type="doi">10.1371/journal.pgen.1007333</pub-id>
<pub-id pub-id-type="pmid">29738521</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref005"><label>5</label>
<mixed-citation publication-type="journal"><name><surname>Li</surname>
<given-names>Y</given-names>
</name>
, <name><surname>Metcalf</surname>
<given-names>BJ</given-names>
</name>
, <name><surname>Chochua</surname>
<given-names>S</given-names>
</name>
, <name><surname>Li</surname>
<given-names>Z</given-names>
</name>
, <name><surname>Gertz</surname>
<given-names>RE</given-names>
</name>
, <name><surname>Walker</surname>
<given-names>H</given-names>
</name>
, <etal>et al</etal>
<article-title>Validation of β-lactam minimum inhibitory concentration predictions for pneumococcal isolates with newly encountered penicillin binding protein (PBP) sequences</article-title>
. <source>BMC Genomics</source>
. <year>2017</year>
;<volume>18</volume>
(<issue>1</issue>
):<fpage>1</fpage>
–<lpage>10</lpage>
. <pub-id pub-id-type="doi">10.1186/s12864-016-3406-7</pub-id>
<pub-id pub-id-type="pmid">28049423</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref006"><label>6</label>
<mixed-citation publication-type="journal"><name><surname>Lees</surname>
<given-names>JA</given-names>
</name>
, <name><surname>Vehkala</surname>
<given-names>M</given-names>
</name>
, <name><surname>Välimäki</surname>
<given-names>N</given-names>
</name>
, <name><surname>Harris</surname>
<given-names>SR</given-names>
</name>
, <name><surname>Chewapreecha</surname>
<given-names>C</given-names>
</name>
, <name><surname>Croucher</surname>
<given-names>NJ</given-names>
</name>
, <etal>et al</etal>
<article-title>Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes. Nat Commun</article-title>
 [<source>Internet]</source>
. <year>2016</year>
;<volume>7</volume>
:<fpage>12797</fpage>
 Available from: <ext-link ext-link-type="uri" xlink:href="http://www.nature.com/doifinder/10.1038/ncomms12797">http://www.nature.com/doifinder/10.1038/ncomms12797</ext-link>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref007"><label>7</label>
<mixed-citation publication-type="journal"><name><surname>Nguyen</surname>
<given-names>M</given-names>
</name>
, <name><surname>Brettin</surname>
<given-names>T</given-names>
</name>
, <name><surname>Long</surname>
<given-names>SW</given-names>
</name>
, <name><surname>Musser</surname>
<given-names>JM</given-names>
</name>
, <name><surname>Olsen</surname>
<given-names>RJ</given-names>
</name>
, <name><surname>Olson</surname>
<given-names>R</given-names>
</name>
, <etal>et al</etal>
<article-title>Developing an in silico minimum inhibitory concentration panel test for Klebsiella pneumonia</article-title>
. <source>Sci Rep</source>
 [Internet]. <year>2018</year>
;<volume>8</volume>
(<issue>1</issue>
):<fpage>1</fpage>
–<lpage>11</lpage>
. Available from: <pub-id pub-id-type="doi">10.1038/s41598-017-17765-5</pub-id>
<pub-id pub-id-type="pmid">29311619</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref008"><label>8</label>
<mixed-citation publication-type="journal"><name><surname>Davis</surname>
<given-names>JJ</given-names>
</name>
, <name><surname>Boisvert</surname>
<given-names>S</given-names>
</name>
, <name><surname>Brettin</surname>
<given-names>T</given-names>
</name>
, <name><surname>Kenyon</surname>
<given-names>RW</given-names>
</name>
, <name><surname>Mao</surname>
<given-names>C</given-names>
</name>
, <name><surname>Olson</surname>
<given-names>R</given-names>
</name>
, <etal>et al</etal>
<article-title>Antimicrobial Resistance Prediction in PATRIC and RAST</article-title>
. <source>Sci Rep</source>
 [Internet]. <year>2016</year>
;<volume>6</volume>
(<issue>May</issue>
):<fpage>1</fpage>
–<lpage>12</lpage>
. Available from: <pub-id pub-id-type="doi">10.1038/srep27930</pub-id>
<pub-id pub-id-type="pmid">28442746</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref009"><label>9</label>
<mixed-citation publication-type="journal"><name><surname>Drouin</surname>
<given-names>A</given-names>
</name>
, <name><surname>Giguère</surname>
<given-names>S</given-names>
</name>
, <name><surname>Déraspe</surname>
<given-names>M</given-names>
</name>
, <name><surname>Marchand</surname>
<given-names>M</given-names>
</name>
, <name><surname>Tyers</surname>
<given-names>M</given-names>
</name>
, <name><surname>Loo</surname>
<given-names>VG</given-names>
</name>
, <etal>et al</etal>
<article-title>Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons</article-title>
. <source>BMC Genomics</source>
 [Internet]. <year>2016</year>
;<volume>17</volume>
(<issue>1</issue>
):<fpage>754</fpage>
 Available from: <ext-link ext-link-type="uri" xlink:href="http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2889-6">http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2889-6</ext-link>
<pub-id pub-id-type="doi">10.1186/s12864-016-2889-6</pub-id>
<pub-id pub-id-type="pmid">27671088</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref010"><label>10</label>
<mixed-citation publication-type="journal"><name><surname>Marinier</surname>
<given-names>E</given-names>
</name>
, <name><surname>Zaheer</surname>
<given-names>R</given-names>
</name>
, <name><surname>Berry</surname>
<given-names>C</given-names>
</name>
, <name><surname>Weedmark</surname>
<given-names>KA</given-names>
</name>
, <name><surname>Domaratzki</surname>
<given-names>M</given-names>
</name>
, <name><surname>Mabon</surname>
<given-names>P</given-names>
</name>
, <etal>et al</etal>
<article-title>Neptune: a bioinformatics tool for rapid discovery of genomic variation in bacterial populations</article-title>
. <source>Nucleic Acids Res</source>
 [Internet]. <year>2017</year>
; Available from: <ext-link ext-link-type="uri" xlink:href="http://academic.oup.com/nar/article/doi/10.1093/nar/gkx702/4083563/Neptune-a-bioinformatics-tool-for-rapid-discovery">http://academic.oup.com/nar/article/doi/10.1093/nar/gkx702/4083563/Neptune-a-bioinformatics-tool-for-rapid-discovery</ext-link>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref011"><label>11</label>
<mixed-citation publication-type="journal"><name><surname>Kaplinski</surname>
<given-names>L</given-names>
</name>
, <name><surname>Lepamets</surname>
<given-names>M</given-names>
</name>
, <name><surname>Remm</surname>
<given-names>M</given-names>
</name>
. <article-title>GenomeTester4: a toolkit for performing basic set operations—union, intersection and complement on k-mer lists</article-title>
. <source>Gigascience</source>
 [Internet]. <year>2015</year>
;<volume>4</volume>
(<issue>1</issue>
):<fpage>58</fpage>
 Available from: <ext-link ext-link-type="uri" xlink:href="https://academic.oup.com/gigascience/article-lookup/doi/10.1186/s13742-015-0097-y">https://academic.oup.com/gigascience/article-lookup/doi/10.1186/s13742-015-0097-y</ext-link>
<pub-id pub-id-type="pmid">26640690</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref012"><label>12</label>
<mixed-citation publication-type="journal"><name><surname>Ondov</surname>
<given-names>BD</given-names>
</name>
, <name><surname>Treangen</surname>
<given-names>TJ</given-names>
</name>
, <name><surname>Melsted</surname>
<given-names>P</given-names>
</name>
, <name><surname>Mallonee</surname>
<given-names>AB</given-names>
</name>
, <name><surname>Bergman</surname>
<given-names>NH</given-names>
</name>
, <name><surname>Koren</surname>
<given-names>S</given-names>
</name>
, <etal>et al</etal>
<article-title>Mash: fast genome and metagenome distance estimation using MinHash</article-title>
. <source>Genome Biol</source>
 [Internet]. <year>2016</year>
;<volume>17</volume>
(<issue>1</issue>
):<fpage>132</fpage>
 Available from: <ext-link ext-link-type="uri" xlink:href="http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x">http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x</ext-link>
<pub-id pub-id-type="doi">10.1186/s13059-016-0997-x</pub-id>
<pub-id pub-id-type="pmid">27323842</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref013"><label>13</label>
<mixed-citation publication-type="journal"><name><surname>Gerstein</surname>
<given-names>M</given-names>
</name>
, <name><surname>Sonnhammer</surname>
<given-names>EL</given-names>
</name>
, <name><surname>Chothia</surname>
<given-names>C</given-names>
</name>
. <article-title>Volume changes in protein evolution</article-title>
. <source>J Mol Biol</source>
. <year>1994</year>
;<volume>236</volume>
(<issue>4</issue>
):<fpage>1067</fpage>
–<lpage>78</lpage>
. <pub-id pub-id-type="pmid">8120887</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref014"><label>14</label>
<mixed-citation publication-type="journal"><name><surname>Pajuste</surname>
<given-names>F-D</given-names>
</name>
, <name><surname>Kaplinski</surname>
<given-names>L</given-names>
</name>
, <name><surname>Möls</surname>
<given-names>M</given-names>
</name>
, <name><surname>Puurand</surname>
<given-names>T</given-names>
</name>
, <name><surname>Lepamets</surname>
<given-names>M</given-names>
</name>
, <name><surname>Remm</surname>
<given-names>M</given-names>
</name>
. <article-title>FastGT: an alignment-free method for calling common SNVs directly from raw sequencing reads. Sci Rep</article-title>
 [<source>Internet]</source>
. <year>2017</year>
;<volume>7</volume>
(<issue>1</issue>
):<fpage>2537</fpage>
 Available from: <ext-link ext-link-type="uri" xlink:href="http://www.nature.com/articles/s41598-017-02487-5">http://www.nature.com/articles/s41598-017-02487-5</ext-link>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref015"><label>15</label>
<mixed-citation publication-type="journal"><name><surname>Barker</surname>
<given-names>KF</given-names>
</name>
. <article-title>Antibiotic resistance: a current perspective</article-title>
. <source>Br J Clin Pharmacol</source>
 [Internet]. <year>1999</year>
<month>8</month>
 [cited 2018 Jun 13];<volume>48</volume>
(<issue>2</issue>
):<fpage>109</fpage>
–<lpage>24</lpage>
. Available from: <ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/pubmed/10417485">http://www.ncbi.nlm.nih.gov/pubmed/10417485</ext-link>
<pub-id pub-id-type="doi">10.1046/j.1365-2125.1999.00997.x</pub-id>
<pub-id pub-id-type="pmid">10417485</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref016"><label>16</label>
<mixed-citation publication-type="journal"><name><surname>SusceptibilityTesting EC on</surname>
<given-names>A</given-names>
</name>
. <article-title>European Committee on Antimicrobial Susceptibility Testing Breakpoint tables for interpretation of MICs and zone diameters European Committee on Antimicrobial Susceptibility Testing Breakpoint tables for interpretation of MICs and zone diameters</article-title>
. <ext-link ext-link-type="uri" xlink:href="http://www.eucast.org/fileadmin/src/media/PDFs/EUCAST_files/Breakpoint_tables/v_50_Breakpoint_Table_01.pdf">http://www.eucast.org/fileadmin/src/media/PDFs/EUCAST_files/Breakpoint_tables/v_50_Breakpoint_Table_01.pdf</ext-link>
 [Internet]. <year>2015</year>
;<fpage>0</fpage>
–<lpage>77</lpage>
. Available from: <ext-link ext-link-type="uri" xlink:href="http://www.eucast.org/fileadmin/src/media/PDFs/EUCAST_files/Breakpoint_tables/v_5.0_Breakpoint_Table_01.pdf">http://www.eucast.org/fileadmin/src/media/PDFs/EUCAST_files/Breakpoint_tables/v_5.0_Breakpoint_Table_01.pdf</ext-link>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref017"><label>17</label>
<mixed-citation publication-type="journal"><name><surname>Fàbrega</surname>
<given-names>A</given-names>
</name>
, <name><surname>Madurga</surname>
<given-names>S</given-names>
</name>
, <name><surname>Giralt</surname>
<given-names>E</given-names>
</name>
, <name><surname>Vila</surname>
<given-names>J</given-names>
</name>
. <article-title>Mechanism of action of and resistance to quinolones</article-title>
. <source>Microb Biotechnol</source>
. <year>2009</year>
;<volume>2</volume>
(<issue>1</issue>
):<fpage>40</fpage>
–<lpage>61</lpage>
. <pub-id pub-id-type="doi">10.1111/j.1751-7915.2008.00063.x</pub-id>
<pub-id pub-id-type="pmid">21261881</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref018"><label>18</label>
<mixed-citation publication-type="journal"><name><surname>Jalal</surname>
<given-names>S</given-names>
</name>
, <name><surname>Wretlind</surname>
<given-names>B</given-names>
</name>
. <article-title>Mechanisms of quinolone resistance in clinical strains of Pseudomonas aeruginosa</article-title>
. <source>Microb Drug Resist</source>
 [Internet]. <year>1998</year>
;<volume>4</volume>
(<issue>4</issue>
):<fpage>257</fpage>
–<lpage>61</lpage>
. Available from: <ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/pubmed/9988043">http://www.ncbi.nlm.nih.gov/pubmed/9988043</ext-link>
<pub-id pub-id-type="doi">10.1089/mdr.1998.4.257</pub-id>
<pub-id pub-id-type="pmid">9988043</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref019"><label>19</label>
<mixed-citation publication-type="journal"><name><surname>Kaminska</surname>
<given-names>KH</given-names>
</name>
, <name><surname>Purta</surname>
<given-names>E</given-names>
</name>
, <name><surname>Hansen</surname>
<given-names>LH</given-names>
</name>
, <name><surname>Bujnicki</surname>
<given-names>JM</given-names>
</name>
, <name><surname>Vester</surname>
<given-names>B</given-names>
</name>
, <name><surname>Long</surname>
<given-names>KS</given-names>
</name>
. <article-title>Insights into the structure, function evolution of the radical-SAM 23S rRNA methyltransferase Cfr that confers antibiotic resistance in bacteria</article-title>
. <source>Nucleic Acids Res</source>
. <year>2009</year>
;<volume>38</volume>
(<issue>5</issue>
):<fpage>1652</fpage>
–<lpage>63</lpage>
. <pub-id pub-id-type="doi">10.1093/nar/gkp1142</pub-id>
<pub-id pub-id-type="pmid">20007606</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref020"><label>20</label>
<mixed-citation publication-type="journal"><name><surname>Carniel</surname>
<given-names>E</given-names>
</name>
. <article-title>The Yersinia high-pathogenicity island: An iron-uptake island</article-title>
. <source>Microbes Infect</source>
. <year>2001</year>
;<volume>3</volume>
(<issue>7</issue>
):<fpage>561</fpage>
–<lpage>9</lpage>
. <pub-id pub-id-type="pmid">11418330</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref021"><label>21</label>
<mixed-citation publication-type="journal"><name><surname>Chen</surname>
<given-names>YT</given-names>
</name>
, <name><surname>Chang</surname>
<given-names>HY</given-names>
</name>
, <name><surname>Lai</surname>
<given-names>YC</given-names>
</name>
, <name><surname>Pan</surname>
<given-names>CC</given-names>
</name>
, <name><surname>Tsai</surname>
<given-names>SF</given-names>
</name>
, <name><surname>Peng</surname>
<given-names>HL</given-names>
</name>
. <article-title>Sequencing and analysis of the large virulence plasmid pLVPK of Klebsiella pneumoniae CG43</article-title>
. <source>Gene</source>
. <year>2004</year>
;<volume>337</volume>
(<issue>1–2</issue>
):<fpage>189</fpage>
–<lpage>98</lpage>
.<pub-id pub-id-type="pmid">15276215</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref022"><label>22</label>
<mixed-citation publication-type="journal"><name><surname>Lagos</surname>
<given-names>R</given-names>
</name>
, <name><surname>Baeza</surname>
<given-names>M</given-names>
</name>
, <name><surname>Corsini</surname>
<given-names>G</given-names>
</name>
, <name><surname>Hetz</surname>
<given-names>C</given-names>
</name>
, <name><surname>Strahsburger</surname>
<given-names>E</given-names>
</name>
, <name><surname>Castillo</surname>
<given-names>JA</given-names>
</name>
, <etal>et al</etal>
<article-title>Structure, organization and characterization of the gene cluster involved in the production of microcin E492, a channel-forming bacteriocin</article-title>
. <source>Mol Microbiol</source>
. <year>2001</year>
;<volume>42</volume>
(<issue>1</issue>
):<fpage>229</fpage>
–<lpage>43</lpage>
. <pub-id pub-id-type="pmid">11679081</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref023"><label>23</label>
<mixed-citation publication-type="journal"><name><surname>Nassif</surname>
<given-names>X</given-names>
</name>
, <name><surname>Sansonetti</surname>
<given-names>PJ</given-names>
</name>
. <article-title>Correlation of the virulence of Klebsiella pneumoniae K1 and K2 with the presence of a plasmid encoding aerobactin</article-title>
. <source>Infect Immun</source>
. <year>1986</year>
;<volume>54</volume>
(<issue>3</issue>
):<fpage>603</fpage>
–<lpage>8</lpage>
. <pub-id pub-id-type="pmid">2946641</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref024"><label>24</label>
<mixed-citation publication-type="journal"><name><surname>Putze</surname>
<given-names>J</given-names>
</name>
, <name><surname>Hennequin</surname>
<given-names>C</given-names>
</name>
, <name><surname>Nougayrède</surname>
<given-names>JP</given-names>
</name>
, <name><surname>Zhang</surname>
<given-names>W</given-names>
</name>
, <name><surname>Homburg</surname>
<given-names>S</given-names>
</name>
, <name><surname>Karch</surname>
<given-names>H</given-names>
</name>
, <etal>et al</etal>
<article-title>Genetic structure and distribution of the colibactin genomic island among members of the family Enterobacteriaceae</article-title>
. <source>Infect Immun</source>
. <year>2009</year>
;<volume>77</volume>
(<issue>11</issue>
):<fpage>4696</fpage>
–<lpage>703</lpage>
. <pub-id pub-id-type="doi">10.1128/IAI.00522-09</pub-id>
<pub-id pub-id-type="pmid">19720753</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref025"><label>25</label>
<mixed-citation publication-type="journal"><name><surname>Chou</surname>
<given-names>HC</given-names>
</name>
, <name><surname>Lee</surname>
<given-names>CZ</given-names>
</name>
, <name><surname>Ma</surname>
<given-names>LC</given-names>
</name>
, <name><surname>Fang</surname>
<given-names>CT</given-names>
</name>
, <name><surname>Chang</surname>
<given-names>SC</given-names>
</name>
, <name><surname>Wang</surname>
<given-names>JT</given-names>
</name>
. <article-title>Isolation of a chromosomal region of <italic>Klebsiella pneumoniae</italic>
 associated with allantoin metabolism and liver infection</article-title>
. <source>Infect Immun</source>
. <year>2004</year>
;<volume>72</volume>
(<issue>7</issue>
):<fpage>3783</fpage>
–<lpage>92</lpage>
. <pub-id pub-id-type="doi">10.1128/IAI.72.7.3783-3792.2004</pub-id>
<pub-id pub-id-type="pmid">15213119</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref026"><label>26</label>
<mixed-citation publication-type="journal"><name><surname>Cheng</surname>
<given-names>HY</given-names>
</name>
, <name><surname>Chen</surname>
<given-names>YS</given-names>
</name>
, <name><surname>Wu</surname>
<given-names>CY</given-names>
</name>
, <name><surname>Chang</surname>
<given-names>HY</given-names>
</name>
, <name><surname>Lai</surname>
<given-names>YC</given-names>
</name>
, <name><surname>Peng</surname>
<given-names>HL</given-names>
</name>
. <article-title>RmpA regulation of capsular polysaccharide biosynthesis in Klebsiella pneumoniae CG43</article-title>
. <source>J Bacteriol</source>
. <year>2010</year>
;<volume>192</volume>
(<issue>12</issue>
):<fpage>3144</fpage>
–<lpage>58</lpage>
. <pub-id pub-id-type="doi">10.1128/JB.00031-10</pub-id>
<pub-id pub-id-type="pmid">20382770</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref027"><label>27</label>
<mixed-citation publication-type="journal"><name><surname>Lai</surname>
<given-names>Y</given-names>
</name>
, <name><surname>Peng</surname>
<given-names>H</given-names>
</name>
, <name><surname>Chang</surname>
<given-names>H</given-names>
</name>
. <article-title>RmpA2, an Activator of Capsule Biosynthesis in. MBio</article-title>
. <year>2003</year>
;<volume>185</volume>
(<issue>3</issue>
):<fpage>788</fpage>
–<lpage>800</lpage>
.</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref028"><label>28</label>
<mixed-citation publication-type="journal"><name><surname>Ma</surname>
<given-names>L-C</given-names>
</name>
, <name><surname>Fang</surname>
<given-names>C-T</given-names>
</name>
, <name><surname>Lee</surname>
<given-names>C-Z</given-names>
</name>
, <name><surname>Shun</surname>
<given-names>C-T</given-names>
</name>
, <name><surname>Wang</surname>
<given-names>J-T</given-names>
</name>
. <article-title>Genomic heterogeneity in Klebsiella pneumoniae strains is associated with primary pyogenic liver abscess and metastatic infection</article-title>
. <source>J Infect Dis</source>
 [Internet]. <year>2005</year>
;<volume>192</volume>
(<issue>1</issue>
):<fpage>117</fpage>
–<lpage>28</lpage>
. Available from: <ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/pubmed/15942901">http://www.ncbi.nlm.nih.gov/pubmed/15942901</ext-link>
<pub-id pub-id-type="doi">10.1086/430619</pub-id>
<pub-id pub-id-type="pmid">15942901</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref029"><label>29</label>
<mixed-citation publication-type="journal"><name><surname>Lai</surname>
<given-names>YC</given-names>
</name>
, <name><surname>Lin</surname>
<given-names>G-T</given-names>
</name>
, <name><surname>Yang</surname>
<given-names>S-L</given-names>
</name>
, <name><surname>Chang</surname>
<given-names>H-Y</given-names>
</name>
, <name><surname>Peng</surname>
<given-names>H-L</given-names>
</name>
, S. I, <etal>et al</etal>
<article-title>Identification and characterization of KvgAS, a two-component system in <italic>Klebsiella pneumoniae</italic>
 CG43</article-title>
. <source>FEMS Microbiol Lett</source>
 [Internet]. <year>2003</year>
;<volume>218</volume>
(<issue>1</issue>
):<fpage>1216</fpage>
 Available from: <ext-link ext-link-type="uri" xlink:href="https://academic.oup.com/femsle/article-lookup/doi/10.1111/j.15746968.2003.tb11507.x">https://academic.oup.com/femsle/article-lookup/doi/10.1111/j.15746968.2003.tb11507.x</ext-link>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref030"><label>30</label>
<mixed-citation publication-type="journal"><name><surname>Bankevich</surname>
<given-names>A</given-names>
</name>
, <name><surname>Nurk</surname>
<given-names>S</given-names>
</name>
, <name><surname>Antipov</surname>
<given-names>D</given-names>
</name>
, <name><surname>Gurevich</surname>
<given-names>AA</given-names>
</name>
, <name><surname>Dvorkin</surname>
<given-names>M</given-names>
</name>
, <name><surname>Kulikov</surname>
<given-names>AS</given-names>
</name>
, <etal>et al</etal>
<article-title>SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing</article-title>
. <source>J Comput Biol</source>
 [Internet]. <year>2012</year>
;<volume>19</volume>
(<issue>5</issue>
):<fpage>455</fpage>
–<lpage>77</lpage>
. Available from: <ext-link ext-link-type="uri" xlink:href="http://online.liebertpub.com/doi/abs/10.1089/cmb.2012.0021">http://online.liebertpub.com/doi/abs/10.1089/cmb.2012.0021</ext-link>
<pub-id pub-id-type="doi">10.1089/cmb.2012.0021</pub-id>
<pub-id pub-id-type="pmid">22506599</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref031"><label>31</label>
<mixed-citation publication-type="journal"><name><surname>Holt</surname>
<given-names>KE</given-names>
</name>
, <name><surname>Wertheim</surname>
<given-names>H</given-names>
</name>
, <name><surname>Zadoks</surname>
<given-names>RN</given-names>
</name>
, <name><surname>Baker</surname>
<given-names>S</given-names>
</name>
, <name><surname>Whitehouse</surname>
<given-names>CA</given-names>
</name>
, <name><surname>Dance</surname>
<given-names>D</given-names>
</name>
, <etal>et al</etal>
<article-title>Genomic analysis of diversity, population structure, virulence, and antimicrobial resistance in <italic>Klebsiella pneumoniae</italic>
, an urgent threat to public health</article-title>
. <source>Proc Natl Acad Sci</source>
 [Internet]. <year>2015</year>
;<volume>112</volume>
(<issue>27</issue>
):<fpage>E3574</fpage>
–<lpage>81</lpage>
. Available from: <ext-link ext-link-type="uri" xlink:href="http://www.pnas.org/lookup/doi/10.1073/pnas.1501049112">http://www.pnas.org/lookup/doi/10.1073/pnas.1501049112</ext-link>
<pub-id pub-id-type="doi">10.1073/pnas.1501049112</pub-id>
<pub-id pub-id-type="pmid">26100894</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref032"><label>32</label>
<mixed-citation publication-type="journal"><name><surname>Knight</surname>
<given-names>R</given-names>
</name>
, <name><surname>Maxwell</surname>
<given-names>P</given-names>
</name>
, <name><surname>Birmingham</surname>
<given-names>A</given-names>
</name>
, <name><surname>Carnes</surname>
<given-names>J</given-names>
</name>
, <name><surname>Caporaso</surname>
<given-names>JG</given-names>
</name>
, <name><surname>Easton</surname>
<given-names>BC</given-names>
</name>
, <etal>et al</etal>
<article-title>PyCogent: a toolkit for making sense from sequence</article-title>
. <source>Genome Biol</source>
 [Internet]. <year>2007</year>
;<volume>8</volume>
(<issue>8</issue>
):<fpage>R171</fpage>
 Available from: <ext-link ext-link-type="uri" xlink:href="http://genomebiology.biomedcentral.com/articles/10.1186/gb-2007-8-8-r171">http://genomebiology.biomedcentral.com/articles/10.1186/gb-2007-8-8-r171</ext-link>
<pub-id pub-id-type="doi">10.1186/gb-2007-8-8-r171</pub-id>
<pub-id pub-id-type="pmid">17708774</pub-id>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref033"><label>33</label>
<mixed-citation publication-type="other">SciPy Community. SciPy Reference Guide 0.16.0. 2013;1229.</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref034"><label>34</label>
<mixed-citation publication-type="journal"><name><surname>Josh Pasek</surname>
<given-names>A</given-names>
</name>
, <article-title>Gene Culter by, Schwemmle Maintainer Josh Pasek M. Package “weights” with some assistance from Alex Tahk and some code modified from R- core; Additional contributions</article-title>
. <year>2016</year>
; Available from: <ext-link ext-link-type="uri" xlink:href="https://cran.r-project.org/web/packages/weights/weights.pdf">https://cran.r-project.org/web/packages/weights/weights.pdf</ext-link>
</mixed-citation>
</ref>
<ref id="pcbi.1006434.ref035"><label>35</label>
<mixed-citation publication-type="journal"><name><surname>Pedregosa</surname>
<given-names>F</given-names>
</name>
, <name><surname>Varoquaux</surname>
<given-names>G</given-names>
</name>
, <name><surname>Gramfort</surname>
<given-names>A</given-names>
</name>
, <name><surname>Michel</surname>
<given-names>V</given-names>
</name>
, <name><surname>Thirion</surname>
<given-names>B</given-names>
</name>
, <name><surname>Grisel</surname>
<given-names>O</given-names>
</name>
, <etal>et al</etal>
<article-title>Scikit-learn: Machine Learning in Python</article-title>
. <source>J Mach Learn Res</source>
 [Internet]. <year>2011</year>
;<volume>12</volume>
:<fpage>2825</fpage>
–<lpage>2830</lpage>
. Available from: <ext-link ext-link-type="uri" xlink:href="http://dl.acm.org/citation.cfm?id=1953048.2078195">http://dl.acm.org/citation.cfm?id=1953048.2078195</ext-link>
%5Cn<ext-link ext-link-type="uri" xlink:href="http://dl.acm.org/ft_gateway.cfm?id=2078195&type=pdf">http://dl.acm.org/ft_gateway.cfm?id=2078195&type=pdf</ext-link>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000F87 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000F87 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:6211763
   |texte=   A k-mer-based method for the identification of phenotype-associated genomic biomarkers and predicting phenotypes of sequenced bacteria
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:30346947" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration MERS

A k-mer-based method for the identification of phenotype-associated genomic biomarkers and predicting phenotypes of sequenced bacteria

A k-mer-based method for the identification of phenotype-associated genomic biomarkers and predicting phenotypes of sequenced bacteria

Source :

Abstract

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki