MersV1, Pmc, Corpus, bibRecord, 000276

Predicting bacterial resistance from whole-genome sequences using k-mers and stability selection

Identifieur interne : 000276 ( Pmc/Corpus ); précédent : 000275; suivant : 000277

Predicting bacterial resistance from whole-genome sequences using k-mers and stability selection

Auteurs : Pierre Mahé ; Maud Tournoud

Source :

BMC Bioinformatics [ 1471-2105 ] ; 2018.

RBID : PMC:6192184

Abstract

Background

Several studies demonstrated the feasibility of predicting bacterial antibiotic resistance phenotypes from whole-genome sequences, the prediction process usually amounting to detecting the presence of genes involved in antibiotic resistance mechanisms, or of specific mutations, previously identified from a training panel of strains, within these genes. We address the problem from the supervised statistical learning perspective, not relying on prior information about such resistance factors. We rely on a k-mer based genotyping scheme and a logistic regression model, thereby combining several k-mers into a probabilistic model. To identify a small yet predictive set of k-mers, we rely on the stability selection approach (Meinshausen et al., J R Stat Soc Ser B 72:417–73, 2010), that consists in penalizing logistic regression models with a Lasso penalty, coupled with extensive resampling procedures.

Results

Using public datasets, we applied the resulting classifiers to two bacterial species and achieved predictive performance equivalent to state of the art. The models are extremely sparse, involving 1 to 8 k-mers per antibiotic, hence are remarkably easy and fast to evaluate on new genomes (from raw reads to assemblies).

Conclusion

Our proof of concept therefore demonstrates that stability selection is a powerful approach to investigate bacterial genotype-phenotype relationships.

Electronic supplementary material

The online version of this article (10.1186/s12859-018-2403-z) contains supplementary material, which is available to authorized users.

Url:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6192184

DOI: 10.1186/s12859-018-2403-z
PubMed: 30332990
PubMed Central: 6192184

Links to Exploration step

PMC:6192184

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Predicting bacterial resistance from whole-genome sequences using <italic>k</italic>
-mers and stability selection</title>
<author><name sortKey="Mahe, Pierre" sort="Mahe, Pierre" uniqKey="Mahe P" first="Pierre" last="Mahé">Pierre Mahé</name>
<affiliation><nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
<author><name sortKey="Tournoud, Maud" sort="Tournoud, Maud" uniqKey="Tournoud M" first="Maud" last="Tournoud">Maud Tournoud</name>
<affiliation><nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">30332990</idno>
<idno type="pmc">6192184</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6192184</idno>
<idno type="RBID">PMC:6192184</idno>
<idno type="doi">10.1186/s12859-018-2403-z</idno>
<date when="2018">2018</date>
<idno type="wicri:Area/Pmc/Corpus">000276</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000276</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Predicting bacterial resistance from whole-genome sequences using <italic>k</italic>
-mers and stability selection</title>
<author><name sortKey="Mahe, Pierre" sort="Mahe, Pierre" uniqKey="Mahe P" first="Pierre" last="Mahé">Pierre Mahé</name>
<affiliation><nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
<author><name sortKey="Tournoud, Maud" sort="Tournoud, Maud" uniqKey="Tournoud M" first="Maud" last="Tournoud">Maud Tournoud</name>
<affiliation><nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint><date when="2018">2018</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><sec><title>Background</title>
<p>Several studies demonstrated the feasibility of predicting bacterial antibiotic resistance phenotypes from whole-genome sequences, the prediction process usually amounting to detecting the presence of genes involved in antibiotic resistance mechanisms, or of specific mutations, previously identified from a training panel of strains, within these genes. We address the problem from the supervised statistical learning perspective, not relying on prior information about such resistance factors. We rely on a <italic>k</italic>
-mer based genotyping scheme and a logistic regression model, thereby combining several <italic>k</italic>
-mers into a probabilistic model. To identify a small yet predictive set of <italic>k</italic>
-mers, we rely on the stability selection approach (Meinshausen et al., J R Stat Soc Ser B 72:417–73, 2010), that consists in penalizing logistic regression models with a Lasso penalty, coupled with extensive resampling procedures.</p>
</sec>
<sec><title>Results</title>
<p>Using public datasets, we applied the resulting classifiers to two bacterial species and achieved predictive performance equivalent to state of the art. The models are extremely sparse, involving 1 to 8 <italic>k</italic>
-mers per antibiotic, hence are remarkably easy and fast to evaluate on new genomes (from raw reads to assemblies).</p>
</sec>
<sec><title>Conclusion</title>
<p>Our proof of concept therefore demonstrates that stability selection is a powerful approach to investigate bacterial genotype-phenotype relationships.</p>
</sec>
<sec><title>Electronic supplementary material</title>
<p>The online version of this article (10.1186/s12859-018-2403-z) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Loman, Nj" uniqKey="Loman N">NJ Loman</name>
</author>
<author><name sortKey="Constantinidou, C" uniqKey="Constantinidou C">C Constantinidou</name>
</author>
<author><name sortKey="Chan, Jz" uniqKey="Chan J">JZ Chan</name>
</author>
<author><name sortKey="Halachev, M" uniqKey="Halachev M">M Halachev</name>
</author>
<author><name sortKey="Sergeant, M" uniqKey="Sergeant M">M Sergeant</name>
</author>
<author><name sortKey="Penn, Cw" uniqKey="Penn C">CW Penn</name>
</author>
<author><name sortKey="Robinson, Er" uniqKey="Robinson E">ER Robinson</name>
</author>
<author><name sortKey="Pallen, Mj" uniqKey="Pallen M">MJ Pallen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Chan, Jacqueline Z M" uniqKey="Chan J">Jacqueline Z-M Chan</name>
</author>
<author><name sortKey="Pallen, Mark J" uniqKey="Pallen M">Mark J Pallen</name>
</author>
<author><name sortKey="Oppenheim, Beryl" uniqKey="Oppenheim B">Beryl Oppenheim</name>
</author>
<author><name sortKey="Constantinidou, Chrystala" uniqKey="Constantinidou C">Chrystala Constantinidou</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bertelli, C" uniqKey="Bertelli C">C. Bertelli</name>
</author>
<author><name sortKey="Greub, G" uniqKey="Greub G">G. Greub</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Didelot, Xavier" uniqKey="Didelot X">Xavier Didelot</name>
</author>
<author><name sortKey="Bowden, Rory" uniqKey="Bowden R">Rory Bowden</name>
</author>
<author><name sortKey="Wilson, Daniel J" uniqKey="Wilson D">Daniel J. Wilson</name>
</author>
<author><name sortKey="Peto, Tim E A" uniqKey="Peto T">Tim E. A. Peto</name>
</author>
<author><name sortKey="Crook, Derrick W" uniqKey="Crook D">Derrick W. Crook</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bergmiller, T" uniqKey="Bergmiller T">T Bergmiller</name>
</author>
<author><name sortKey="Andersson, Am" uniqKey="Andersson A">AM Andersson</name>
</author>
<author><name sortKey="Tomasek, K" uniqKey="Tomasek K">K Tomasek</name>
</author>
<author><name sortKey="Balleza, E" uniqKey="Balleza E">E Balleza</name>
</author>
<author><name sortKey="Kiviet, Dj" uniqKey="Kiviet D">DJ Kiviet</name>
</author>
<author><name sortKey="Hauschild, R" uniqKey="Hauschild R">R Hauschild</name>
</author>
<author><name sortKey="Tka Ik, G" uniqKey="Tka Ik G">G Tkačik</name>
</author>
<author><name sortKey="Guet, Cc" uniqKey="Guet C">CC Guet</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Gordon, Nc" uniqKey="Gordon N">NC Gordon</name>
</author>
<author><name sortKey="Price, Jr" uniqKey="Price J">JR Price</name>
</author>
<author><name sortKey="Cole, K" uniqKey="Cole K">K Cole</name>
</author>
<author><name sortKey="Everitt, R" uniqKey="Everitt R">R Everitt</name>
</author>
<author><name sortKey="Morgan, M" uniqKey="Morgan M">M Morgan</name>
</author>
<author><name sortKey="Finney, F" uniqKey="Finney F">F Finney</name>
</author>
<author><name sortKey="Kearns, Am" uniqKey="Kearns A">AM Kearns</name>
</author>
<author><name sortKey="Pichon, B" uniqKey="Pichon B">B Pichon</name>
</author>
<author><name sortKey="Young, B" uniqKey="Young B">B Young</name>
</author>
<author><name sortKey="Wilson, Dj" uniqKey="Wilson D">DJ Wilson</name>
</author>
<author><name sortKey="Llewelyn, Mj" uniqKey="Llewelyn M">MJ Llewelyn</name>
</author>
<author><name sortKey="Paul, J" uniqKey="Paul J">J Paul</name>
</author>
<author><name sortKey="Peto, Tea" uniqKey="Peto T">TEA Peto</name>
</author>
<author><name sortKey="Crook, D" uniqKey="Crook D">D Crook</name>
</author>
<author><name sortKey="Walker, As" uniqKey="Walker A">AS Walker</name>
</author>
<author><name sortKey="Golubchika, T" uniqKey="Golubchika T">T Golubchika</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bradley, P" uniqKey="Bradley P">P Bradley</name>
</author>
<author><name sortKey="Gordon, Nc" uniqKey="Gordon N">NC Gordon</name>
</author>
<author><name sortKey="Walker, Tm" uniqKey="Walker T">TM Walker</name>
</author>
<author><name sortKey="Dunn, L" uniqKey="Dunn L">L Dunn</name>
</author>
<author><name sortKey="Heys, S" uniqKey="Heys S">S Heys</name>
</author>
<author><name sortKey="Huang, B" uniqKey="Huang B">B Huang</name>
</author>
<author><name sortKey="Earle, S" uniqKey="Earle S">S Earle</name>
</author>
<author><name sortKey="Pankhurst, L" uniqKey="Pankhurst L">L Pankhurst</name>
</author>
<author><name sortKey="Anson, L" uniqKey="Anson L">L Anson</name>
</author>
<author><name sortKey="De Cesare, M" uniqKey="De Cesare M">M de Cesare</name>
</author>
<author><name sortKey="Piazza, P" uniqKey="Piazza P">P Piazza</name>
</author>
<author><name sortKey="Votintseva, Aa" uniqKey="Votintseva A">AA Votintseva</name>
</author>
<author><name sortKey="Golubchik, T" uniqKey="Golubchik T">T Golubchik</name>
</author>
<author><name sortKey="Wilson, Dj" uniqKey="Wilson D">DJ Wilson</name>
</author>
<author><name sortKey="Wyllie, Dh" uniqKey="Wyllie D">DH Wyllie</name>
</author>
<author><name sortKey="Diel, R" uniqKey="Diel R">R Diel</name>
</author>
<author><name sortKey="Niemann, S" uniqKey="Niemann S">S Niemann</name>
</author>
<author><name sortKey="Feuerriegel, S" uniqKey="Feuerriegel S">S Feuerriegel</name>
</author>
<author><name sortKey="Kohl, Ta" uniqKey="Kohl T">TA Kohl</name>
</author>
<author><name sortKey="Ismail, N" uniqKey="Ismail N">N Ismail</name>
</author>
<author><name sortKey="Omar, Sv" uniqKey="Omar S">SV Omar</name>
</author>
<author><name sortKey="Smith, Eg" uniqKey="Smith E">EG Smith</name>
</author>
<author><name sortKey="Buck, D" uniqKey="Buck D">D Buck</name>
</author>
<author><name sortKey="Mcvean, G" uniqKey="Mcvean G">G McVean</name>
</author>
<author><name sortKey="Walker, As" uniqKey="Walker A">AS Walker</name>
</author>
<author><name sortKey="Peto, T" uniqKey="Peto T">T Peto</name>
</author>
<author><name sortKey="Crook, D" uniqKey="Crook D">D Crook</name>
</author>
<author><name sortKey="Iqbal, Z" uniqKey="Iqbal Z">Z Iqbal</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Walker, Tm" uniqKey="Walker T">TM Walker</name>
</author>
<author><name sortKey="Kohl, Ta" uniqKey="Kohl T">TA Kohl</name>
</author>
<author><name sortKey="Omar, Sv" uniqKey="Omar S">SV Omar</name>
</author>
<author><name sortKey="Hedge, J" uniqKey="Hedge J">J Hedge</name>
</author>
<author><name sortKey="Elias, Cdo" uniqKey="Elias C">CDO Elias</name>
</author>
<author><name sortKey="Bradley, P" uniqKey="Bradley P">P Bradley</name>
</author>
<author><name sortKey="Iqbal, Z" uniqKey="Iqbal Z">Z Iqbal</name>
</author>
<author><name sortKey="Feuerriegel, S" uniqKey="Feuerriegel S">S Feuerriegel</name>
</author>
<author><name sortKey="Niehaus, Ke" uniqKey="Niehaus K">KE Niehaus</name>
</author>
<author><name sortKey="Wilson, Dj" uniqKey="Wilson D">DJ Wilson</name>
</author>
<author><name sortKey="Clifton, Da" uniqKey="Clifton D">DA Clifton</name>
</author>
<author><name sortKey="Kapatai, G" uniqKey="Kapatai G">G Kapatai</name>
</author>
<author><name sortKey="Ip, Clc" uniqKey="Ip C">CLC Ip</name>
</author>
<author><name sortKey="Bowden, R" uniqKey="Bowden R">R Bowden</name>
</author>
<author><name sortKey="Drobniewski, Fa" uniqKey="Drobniewski F">FA Drobniewski</name>
</author>
<author><name sortKey="Allix Beguec, C" uniqKey="Allix Beguec C">C Allix-Béguec</name>
</author>
<author><name sortKey="Gaudin, C" uniqKey="Gaudin C">C Gaudin</name>
</author>
<author><name sortKey="Parkhill, J" uniqKey="Parkhill J">J Parkhill</name>
</author>
<author><name sortKey="Diel, R" uniqKey="Diel R">R Diel</name>
</author>
<author><name sortKey="Supply, P" uniqKey="Supply P">P Supply</name>
</author>
<author><name sortKey="Crook, D" uniqKey="Crook D">D Crook</name>
</author>
<author><name sortKey="Smith, Eg" uniqKey="Smith E">EG Smith</name>
</author>
<author><name sortKey="Walker, As" uniqKey="Walker A">AS Walker</name>
</author>
<author><name sortKey="Ismail, N" uniqKey="Ismail N">N Ismail</name>
</author>
<author><name sortKey="Niemann, S" uniqKey="Niemann S">S Niemann</name>
</author>
<author><name sortKey="Peto, Tea" uniqKey="Peto T">TEA Peto</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Coll, F" uniqKey="Coll F">F Coll</name>
</author>
<author><name sortKey="Mcnerney, R" uniqKey="Mcnerney R">R McNerney</name>
</author>
<author><name sortKey="Preston, Md" uniqKey="Preston M">MD Preston</name>
</author>
<author><name sortKey="Guerra Assuncao, Ja" uniqKey="Guerra Assuncao J">JA Guerra-Assunção</name>
</author>
<author><name sortKey="Warry, A" uniqKey="Warry A">A Warry</name>
</author>
<author><name sortKey="Hill Cawthorne, G" uniqKey="Hill Cawthorne G">G Hill-Cawthorne</name>
</author>
<author><name sortKey="Mallard, K" uniqKey="Mallard K">K Mallard</name>
</author>
<author><name sortKey="Nair, M" uniqKey="Nair M">M Nair</name>
</author>
<author><name sortKey="Miranda, A" uniqKey="Miranda A">A Miranda</name>
</author>
<author><name sortKey="Alves, A" uniqKey="Alves A">A Alves</name>
</author>
<author><name sortKey="Perdigao, J" uniqKey="Perdigao J">J Perdigão</name>
</author>
<author><name sortKey="Viveiros, M" uniqKey="Viveiros M">M Viveiros</name>
</author>
<author><name sortKey="Portugal, I" uniqKey="Portugal I">I Portugal</name>
</author>
<author><name sortKey="Hasan, Z" uniqKey="Hasan Z">Z Hasan</name>
</author>
<author><name sortKey="Hasan, R" uniqKey="Hasan R">R Hasan</name>
</author>
<author><name sortKey="Glynn, Jr" uniqKey="Glynn J">JR Glynn</name>
</author>
<author><name sortKey="Martin, N" uniqKey="Martin N">N Martin</name>
</author>
<author><name sortKey="Pain, A" uniqKey="Pain A">A Pain</name>
</author>
<author><name sortKey="Clark, Tg" uniqKey="Clark T">TG Clark</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Palomino, Jc" uniqKey="Palomino J">JC Palomino</name>
</author>
<author><name sortKey="Martin, A" uniqKey="Martin A">A Martin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zhang, Y" uniqKey="Zhang Y">Y Zhang</name>
</author>
<author><name sortKey="Yew, Ww" uniqKey="Yew W">WW Yew</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zhang, H" uniqKey="Zhang H">H Zhang</name>
</author>
<author><name sortKey="Li, D" uniqKey="Li D">D Li</name>
</author>
<author><name sortKey="Zhao, L" uniqKey="Zhao L">L Zhao</name>
</author>
<author><name sortKey="Fleming, J" uniqKey="Fleming J">J Fleming</name>
</author>
<author><name sortKey="Lin, N" uniqKey="Lin N">N Lin</name>
</author>
<author><name sortKey="Wang, T" uniqKey="Wang T">T Wang</name>
</author>
<author><name sortKey="Liu, Z" uniqKey="Liu Z">Z Liu</name>
</author>
<author><name sortKey="Li, C" uniqKey="Li C">C Li</name>
</author>
<author><name sortKey="Galwey, N" uniqKey="Galwey N">N Galwey</name>
</author>
<author><name sortKey="Deng, J" uniqKey="Deng J">J Deng</name>
</author>
<author><name sortKey="Zhou, Y" uniqKey="Zhou Y">Y Zhou</name>
</author>
<author><name sortKey="Zhu, Y" uniqKey="Zhu Y">Y Zhu</name>
</author>
<author><name sortKey="Gao, Y" uniqKey="Gao Y">Y Gao</name>
</author>
<author><name sortKey="Wang, T" uniqKey="Wang T">T Wang</name>
</author>
<author><name sortKey="Wang, S" uniqKey="Wang S">S Wang</name>
</author>
<author><name sortKey="Huang, Y" uniqKey="Huang Y">Y Huang</name>
</author>
<author><name sortKey="Wang, M" uniqKey="Wang M">M Wang</name>
</author>
<author><name sortKey="Zhong, Q" uniqKey="Zhong Q">Q Zhong</name>
</author>
<author><name sortKey="Zhou, L" uniqKey="Zhou L">L Zhou</name>
</author>
<author><name sortKey="Chen, T" uniqKey="Chen T">T Chen</name>
</author>
<author><name sortKey="Zhou, J" uniqKey="Zhou J">J Zhou</name>
</author>
<author><name sortKey="Yang, R" uniqKey="Yang R">R Yang</name>
</author>
<author><name sortKey="Zhu, G" uniqKey="Zhu G">G Zhu</name>
</author>
<author><name sortKey="Hang, H" uniqKey="Hang H">H Hang</name>
</author>
<author><name sortKey="Zhang, J" uniqKey="Zhang J">J Zhang</name>
</author>
<author><name sortKey="Li, F" uniqKey="Li F">F Li</name>
</author>
<author><name sortKey="Wan, K" uniqKey="Wan K">K Wan</name>
</author>
<author><name sortKey="Wang, J" uniqKey="Wang J">J Wang</name>
</author>
<author><name sortKey="Zhang, X E" uniqKey="Zhang X">X-E Zhang</name>
</author>
<author><name sortKey="Bi, L" uniqKey="Bi L">L Bi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Palmer, Ac" uniqKey="Palmer A">AC Palmer</name>
</author>
<author><name sortKey="Kishony, R" uniqKey="Kishony R">R Kishony</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lees, Ja" uniqKey="Lees J">JA Lees</name>
</author>
<author><name sortKey="Vehkala, M" uniqKey="Vehkala M">M Vehkala</name>
</author>
<author><name sortKey="V Lim Ki, N" uniqKey="V Lim Ki N">N Välimäki</name>
</author>
<author><name sortKey="Harris, Sr" uniqKey="Harris S">SR Harris</name>
</author>
<author><name sortKey="Chewapreecha, C" uniqKey="Chewapreecha C">C Chewapreecha</name>
</author>
<author><name sortKey="Croucher, Nj" uniqKey="Croucher N">NJ Croucher</name>
</author>
<author><name sortKey="Marttinen, P" uniqKey="Marttinen P">P Marttinen</name>
</author>
<author><name sortKey="Honkela, A" uniqKey="Honkela A">A Honkela</name>
</author>
<author><name sortKey="Parkhill, J" uniqKey="Parkhill J">J Parkhill</name>
</author>
<author><name sortKey="Bentley, Sd" uniqKey="Bentley S">SD Bentley</name>
</author>
<author><name sortKey="Corander, J" uniqKey="Corander J">J Corander</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Earle, Sg" uniqKey="Earle S">SG Earle</name>
</author>
<author><name sortKey="Wu, C H" uniqKey="Wu C">C-H Wu</name>
</author>
<author><name sortKey="Charlesworth, J" uniqKey="Charlesworth J">J Charlesworth</name>
</author>
<author><name sortKey="Stoesser, N" uniqKey="Stoesser N">N Stoesser</name>
</author>
<author><name sortKey="Gordon, Nc" uniqKey="Gordon N">NC Gordon</name>
</author>
<author><name sortKey="Walker, Tm" uniqKey="Walker T">TM Walker</name>
</author>
<author><name sortKey="Spencer, Cca" uniqKey="Spencer C">CCA Spencer</name>
</author>
<author><name sortKey="Iqbal, Z" uniqKey="Iqbal Z">Z Iqbal</name>
</author>
<author><name sortKey="Clifton, Da" uniqKey="Clifton D">DA Clifton</name>
</author>
<author><name sortKey="Hopkins, Kl" uniqKey="Hopkins K">KL Hopkins</name>
</author>
<author><name sortKey="Woodford, N" uniqKey="Woodford N">N Woodford</name>
</author>
<author><name sortKey="Smith, Eg" uniqKey="Smith E">EG Smith</name>
</author>
<author><name sortKey="Ismail, N" uniqKey="Ismail N">N Ismail</name>
</author>
<author><name sortKey="Llewelyn, Mj" uniqKey="Llewelyn M">MJ Llewelyn</name>
</author>
<author><name sortKey="Peto, Te" uniqKey="Peto T">TE Peto</name>
</author>
<author><name sortKey="Crook, D" uniqKey="Crook D">D Crook</name>
</author>
<author><name sortKey="Mcvean, G" uniqKey="Mcvean G">G McVean</name>
</author>
<author><name sortKey="Walker, As" uniqKey="Walker A">AS Walker</name>
</author>
<author><name sortKey="Wilson, Dj" uniqKey="Wilson D">DJ Wilson</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Drouin, A" uniqKey="Drouin A">A Drouin</name>
</author>
<author><name sortKey="Giguere, S" uniqKey="Giguere S">S Giguère</name>
</author>
<author><name sortKey="Deraspe, M" uniqKey="Deraspe M">M Déraspe</name>
</author>
<author><name sortKey="Marchand, M" uniqKey="Marchand M">M Marchand</name>
</author>
<author><name sortKey="Tyers, M" uniqKey="Tyers M">M Tyers</name>
</author>
<author><name sortKey="Loo, Vg" uniqKey="Loo V">VG Loo</name>
</author>
<author><name sortKey="Bourgault, A M" uniqKey="Bourgault A">A-M Bourgault</name>
</author>
<author><name sortKey="Laviolette, F" uniqKey="Laviolette F">F Laviolette</name>
</author>
<author><name sortKey="Corbeil, J" uniqKey="Corbeil J">J Corbeil</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Davis, Jj" uniqKey="Davis J">JJ Davis</name>
</author>
<author><name sortKey="Boisvert, S" uniqKey="Boisvert S">S Boisvert</name>
</author>
<author><name sortKey="Brettin, T" uniqKey="Brettin T">T Brettin</name>
</author>
<author><name sortKey="Kenyon, Rw" uniqKey="Kenyon R">RW Kenyon</name>
</author>
<author><name sortKey="Mao, C" uniqKey="Mao C">C Mao</name>
</author>
<author><name sortKey="Olson, R" uniqKey="Olson R">R Olson</name>
</author>
<author><name sortKey="Overbeek, R" uniqKey="Overbeek R">R Overbeek</name>
</author>
<author><name sortKey="Santerre, J" uniqKey="Santerre J">J Santerre</name>
</author>
<author><name sortKey="Shukla, M" uniqKey="Shukla M">M Shukla</name>
</author>
<author><name sortKey="Wattam, Ar" uniqKey="Wattam A">AR Wattam</name>
</author>
<author><name sortKey="Will, R" uniqKey="Will R">R Will</name>
</author>
<author><name sortKey="Xia, F" uniqKey="Xia F">F Xia</name>
</author>
<author><name sortKey="Stevens, R" uniqKey="Stevens R">R Stevens</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Eyre, Dw" uniqKey="Eyre D">DW Eyre</name>
</author>
<author><name sortKey="De Silva, D" uniqKey="De Silva D">D De Silva</name>
</author>
<author><name sortKey="Cole, K" uniqKey="Cole K">K Cole</name>
</author>
<author><name sortKey="Peters, J" uniqKey="Peters J">J Peters</name>
</author>
<author><name sortKey="Cole, Mj" uniqKey="Cole M">MJ Cole</name>
</author>
<author><name sortKey="Grad, Yh" uniqKey="Grad Y">YH Grad</name>
</author>
<author><name sortKey="Demczuk, W" uniqKey="Demczuk W">W Demczuk</name>
</author>
<author><name sortKey="Martin, I" uniqKey="Martin I">I Martin</name>
</author>
<author><name sortKey="Mulvey, Mr" uniqKey="Mulvey M">MR Mulvey</name>
</author>
<author><name sortKey="Crook, D" uniqKey="Crook D">D Crook</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Meinshausen, N" uniqKey="Meinshausen N">N Meinshausen</name>
</author>
<author><name sortKey="Buhlmann, P" uniqKey="Buhlmann P">P Bühlmann</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Boisvert, Sebastien" uniqKey="Boisvert S">Sébastien Boisvert</name>
</author>
<author><name sortKey="Raymond, Frederic" uniqKey="Raymond F">Frédéric Raymond</name>
</author>
<author><name sortKey="Godzaridis, Elenie" uniqKey="Godzaridis E">Élénie Godzaridis</name>
</author>
<author><name sortKey="Laviolette, Francois" uniqKey="Laviolette F">François Laviolette</name>
</author>
<author><name sortKey="Corbeil, Jacques" uniqKey="Corbeil J">Jacques Corbeil</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Lim, C" uniqKey="Lim C">C Lim</name>
</author>
<author><name sortKey="Yu, B" uniqKey="Yu B">B Yu</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Friedman, J" uniqKey="Friedman J">J Friedman</name>
</author>
<author><name sortKey="Hastie, T" uniqKey="Hastie T">T Hastie</name>
</author>
<author><name sortKey="Tibshirani, R" uniqKey="Tibshirani R">R Tibshirani</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
<author><name sortKey="Phillippy, A" uniqKey="Phillippy A">A Phillippy</name>
</author>
<author><name sortKey="Delcher, Al" uniqKey="Delcher A">AL Delcher</name>
</author>
<author><name sortKey="Smoot, M" uniqKey="Smoot M">M Smoot</name>
</author>
<author><name sortKey="Shumway, M" uniqKey="Shumway M">M Shumway</name>
</author>
<author><name sortKey="Antonescu, C" uniqKey="Antonescu C">C Antonescu</name>
</author>
<author><name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Chikhi, R" uniqKey="Chikhi R">R Chikhi</name>
</author>
<author><name sortKey="Limasset, A" uniqKey="Limasset A">A Limasset</name>
</author>
<author><name sortKey="Medvedev, P" uniqKey="Medvedev P">P Medvedev</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Li, J" uniqKey="Li J">J Li</name>
</author>
<author><name sortKey="Gao, X" uniqKey="Gao X">X Gao</name>
</author>
<author><name sortKey="Luo, T" uniqKey="Luo T">T Luo</name>
</author>
<author><name sortKey="Wu, J" uniqKey="Wu J">J Wu</name>
</author>
<author><name sortKey="Sun, G" uniqKey="Sun G">G Sun</name>
</author>
<author><name sortKey="Liu, Q" uniqKey="Liu Q">Q Liu</name>
</author>
<author><name sortKey="Jiang, Y" uniqKey="Jiang Y">Y Jiang</name>
</author>
<author><name sortKey="Zhang, Y" uniqKey="Zhang Y">Y Zhang</name>
</author>
<author><name sortKey="Mei, J" uniqKey="Mei J">J Mei</name>
</author>
<author><name sortKey="Gao, Q" uniqKey="Gao Q">Q Gao</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Jnawali, Hn" uniqKey="Jnawali H">HN Jnawali</name>
</author>
<author><name sortKey="Ryoo, S" uniqKey="Ryoo S">S Ryoo</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Kim, S" uniqKey="Kim S">S Kim</name>
</author>
<author><name sortKey="Xing, Ep" uniqKey="Xing E">EP Xing</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Vervier, K" uniqKey="Vervier K">K Vervier</name>
</author>
<author><name sortKey="Mahe, P" uniqKey="Mahe P">P Mahé</name>
</author>
<author><name sortKey="D Spremont, A" uniqKey="D Spremont A">A D’Aspremont</name>
</author>
<author><name sortKey="Veyrieras, J B" uniqKey="Veyrieras J">J-B Veyrieras</name>
</author>
<author><name sortKey="Vert, J P" uniqKey="Vert J">J-P Vert</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Mccullagh, P" uniqKey="Mccullagh P">P McCullagh</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Dundar, M" uniqKey="Dundar M">M Dundar</name>
</author>
<author><name sortKey="Krishnapuram, B" uniqKey="Krishnapuram B">B Krishnapuram</name>
</author>
<author><name sortKey="Bi, J" uniqKey="Bi J">J Bi</name>
</author>
<author><name sortKey="Rao, Rb" uniqKey="Rao R">RB Rao</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Votintseva, Antonina A" uniqKey="Votintseva A">Antonina A. Votintseva</name>
</author>
<author><name sortKey="Bradley, Phelim" uniqKey="Bradley P">Phelim Bradley</name>
</author>
<author><name sortKey="Pankhurst, Louise" uniqKey="Pankhurst L">Louise Pankhurst</name>
</author>
<author><name sortKey="Del Ojo Elias, Carlos" uniqKey="Del Ojo Elias C">Carlos del Ojo Elias</name>
</author>
<author><name sortKey="Loose, Matthew" uniqKey="Loose M">Matthew Loose</name>
</author>
<author><name sortKey="Nilgiriwala, Kayzad" uniqKey="Nilgiriwala K">Kayzad Nilgiriwala</name>
</author>
<author><name sortKey="Chatterjee, Anirvan" uniqKey="Chatterjee A">Anirvan Chatterjee</name>
</author>
<author><name sortKey="Smith, E Grace" uniqKey="Smith E">E. Grace Smith</name>
</author>
<author><name sortKey="Sanderson, Nicolas" uniqKey="Sanderson N">Nicolas Sanderson</name>
</author>
<author><name sortKey="Walker, Timothy M" uniqKey="Walker T">Timothy M. Walker</name>
</author>
<author><name sortKey="Morgan, Marcus R" uniqKey="Morgan M">Marcus R. Morgan</name>
</author>
<author><name sortKey="Wyllie, David H" uniqKey="Wyllie D">David H. Wyllie</name>
</author>
<author><name sortKey="Walker, A Sarah" uniqKey="Walker A">A. Sarah Walker</name>
</author>
<author><name sortKey="Peto, Tim E A" uniqKey="Peto T">Tim E. A. Peto</name>
</author>
<author><name sortKey="Crook, Derrick W" uniqKey="Crook D">Derrick W. Crook</name>
</author>
<author><name sortKey="Iqbal, Zamin" uniqKey="Iqbal Z">Zamin Iqbal</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group><journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher><publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">30332990</article-id>
<article-id pub-id-type="pmc">6192184</article-id>
<article-id pub-id-type="publisher-id">2403</article-id>
<article-id pub-id-type="doi">10.1186/s12859-018-2403-z</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Methodology Article</subject>
</subj-group>
</article-categories>
<title-group><article-title>Predicting bacterial resistance from whole-genome sequences using <italic>k</italic>
-mers and stability selection</article-title>
</title-group>
<contrib-group><contrib contrib-type="author" corresp="yes"><contrib-id contrib-id-type="orcid">http://orcid.org/0000-0002-3173-6614</contrib-id>
<name><surname>Mahé</surname>
<given-names>Pierre</given-names>
</name>
<address><email>pierre.mahe@biomerieux.com</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<contrib contrib-type="author"><name><surname>Tournoud</surname>
<given-names>Maud</given-names>
</name>
<address><email>maud.tournoud@biomerieux.com</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<aff id="Aff1"><institution-wrap><institution-id institution-id-type="ISNI">0000 0004 0387 6489</institution-id>
<institution-id institution-id-type="GRID">grid.424167.2</institution-id>
<institution>bioMérieux,</institution>
</institution-wrap>
Chemin de l’Orme, Marcy l’Etoile, 69280 France</aff>
</contrib-group>
<pub-date pub-type="epub"><day>17</day>
<month>10</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="pmc-release"><day>17</day>
<month>10</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="collection"><year>2018</year>
</pub-date>
<volume>19</volume>
<elocation-id>383</elocation-id>
<history><date date-type="received"><day>13</day>
<month>2</month>
<year>2018</year>
</date>
<date date-type="accepted"><day>1</day>
<month>10</month>
<year>2018</year>
</date>
</history>
<permissions><copyright-statement>© The Author(s) 2018</copyright-statement>
<license license-type="OpenAccess"><license-p><bold>Open Access</bold>
 This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1"><sec><title>Background</title>
<p>Several studies demonstrated the feasibility of predicting bacterial antibiotic resistance phenotypes from whole-genome sequences, the prediction process usually amounting to detecting the presence of genes involved in antibiotic resistance mechanisms, or of specific mutations, previously identified from a training panel of strains, within these genes. We address the problem from the supervised statistical learning perspective, not relying on prior information about such resistance factors. We rely on a <italic>k</italic>
-mer based genotyping scheme and a logistic regression model, thereby combining several <italic>k</italic>
-mers into a probabilistic model. To identify a small yet predictive set of <italic>k</italic>
-mers, we rely on the stability selection approach (Meinshausen et al., J R Stat Soc Ser B 72:417–73, 2010), that consists in penalizing logistic regression models with a Lasso penalty, coupled with extensive resampling procedures.</p>
</sec>
<sec><title>Results</title>
<p>Using public datasets, we applied the resulting classifiers to two bacterial species and achieved predictive performance equivalent to state of the art. The models are extremely sparse, involving 1 to 8 <italic>k</italic>
-mers per antibiotic, hence are remarkably easy and fast to evaluate on new genomes (from raw reads to assemblies).</p>
</sec>
<sec><title>Conclusion</title>
<p>Our proof of concept therefore demonstrates that stability selection is a powerful approach to investigate bacterial genotype-phenotype relationships.</p>
</sec>
<sec><title>Electronic supplementary material</title>
<p>The online version of this article (10.1186/s12859-018-2403-z) contains supplementary material, which is available to authorized users.</p>
</sec>
</abstract>
<kwd-group xml:lang="en"><title>Keywords</title>
<kwd>Genotype-phenotype</kwd>
<kwd>Feature selection</kwd>
<kwd><italic>k</italic>
-mers</kwd>
<kwd>Lasso</kwd>
</kwd-group>
<custom-meta-group><custom-meta><meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2018</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body><sec id="Sec1"><title>Background</title>
<p>Recent advances in Next-Generation Sequencing (NGS) technologies provided new tools to sequence large amounts of DNA at a reasonable cost and in a limited period [<xref ref-type="bibr" rid="CR1">1</xref>
]. This technological breakthrough is expected to significantly modify the landscape and practices in the field of clinical microbiology. Microorganisms can now be characterized with unprecedented resolution, which can have a significant impact for both research and diagnostics purposes (see e.g., [<xref ref-type="bibr" rid="CR2">2</xref>
, <xref ref-type="bibr" rid="CR3">3</xref>
]). In terms of diagnostics, NGS indeed holds the promise of addressing, in a single experiment, the main questions of clinical interest: identifying an isolate and determining its antibiotic resistance and virulence profile [<xref ref-type="bibr" rid="CR4">4</xref>
]. The genetic bases of antibiotic resistance and virulence remain partly unknown for most bacterial species, and it is still on open question whether the resistance or virulence of a microorganism can be inferred from its genome only. A recent study showed for example that an isogenic bacterial population exhibited heterogeneity in drug susceptibility due to random partitioning of efflux-pumps during cellular division [<xref ref-type="bibr" rid="CR5">5</xref>
].</p>
<p>Nevertheless, several works have demonstrated the feasibility of genotypic approaches for detection of antibiotic resistance, where a good concordance has been observed between resistance phenotypes predicted from microorganisms genomes, and their reference phenotypes, determined experimentally by assessing their ability to grow in the presence of antibiotics. The genetic bases of antibiotic resistances are for instance well known for <italic>Staphylococcus aureus</italic>
 and <italic>Mycobacterium tuberculosis</italic>
, and accurate predictions could be achieved for these species by simply detecting specific genetic resistance determinants [<xref ref-type="bibr" rid="CR6">6</xref>
–<xref ref-type="bibr" rid="CR11">11</xref>
]. This strategy is notably implemented in well-established tools like Mykrobe [<xref ref-type="bibr" rid="CR7">7</xref>
] and TBProfiler [<xref ref-type="bibr" rid="CR9">9</xref>
], that leverage for instance catalogs of more than 200 and 1300 mutations, respectively, to predict resistance of <italic>M. tuberculosis</italic>
 to 9 and 10 antibiotics or antibiotic families. While this approach proved to be effective for these relatively clonal species, it may suffer from two limitations if transposed to other bacterial species and/or drugs. The first limitation is that it intrinsically relies on prior knowledge of resistance determinants, which may not be available for all species. Non-coding regions are not explored either, which could be beneficial even for <italic>M. tuberculosis</italic>
, for which resistance is mostly due to the presence and accumulation of mutations and indels within a limited number of core genes [<xref ref-type="bibr" rid="CR12">12</xref>
–<xref ref-type="bibr" rid="CR14">14</xref>
]. Secondly, resistance prediction rules typically rely on the presence of at least one resistance determinant, whereas it may be beneficial to combine several ones in a common prediction model to address complex multi-factorial resistance mechanisms, or to model the accumulation of several mutations eventually leading to resistance [<xref ref-type="bibr" rid="CR12">12</xref>
, <xref ref-type="bibr" rid="CR15">15</xref>
]. With these two limitations in mind, we address the problem from the supervised learning perspective and build multi-factorial prediction rules from a panel of strains, for which whole-genome sequences and resistance phenotypes are available, without relying on any prior knowledge about the resistance determinants. We adopt a systematic <italic>k</italic>
-mer based strain genotyping scheme, where any possible <italic>k</italic>
-mer is a candidate determinant, and rely on the logistic regression model to combine several <italic>k</italic>
-mers into a probabilistic prediction rule.</p>
<p>Genotyping strains with <italic>k</italic>
-mers, where every sequence of length <italic>k</italic>
 found in a genome is a putative resistance determinant, allows to circumvent knowing the genes involved in antibiotic resistance. This approach offers the additional benefits of being alignment-free and able to capture various types of genetic determinants, like the presence of genes, as well as Single-Nucleotide Polymorphisms (SNPs) and indels that can be located in coding or non-coding regions [<xref ref-type="bibr" rid="CR16">16</xref>
]. Such <italic>k</italic>
-mer based representations are therefore increasingly popular in this context, for both genome-wide association studies [<xref ref-type="bibr" rid="CR16">16</xref>
, <xref ref-type="bibr" rid="CR17">17</xref>
] and predictive modelling [<xref ref-type="bibr" rid="CR18">18</xref>
, <xref ref-type="bibr" rid="CR19">19</xref>
]. The probabilistic framework offered by the logistic regression model is also appealing. First, it naturally combines several genomic determinants in a global predictive model with weights modulating their respective effects, hence quantifying their relative predictive power and reflecting the fact that they can be associated with different levels of resistance [<xref ref-type="bibr" rid="CR12">12</xref>
]. It also provides a probabilistic prediction, which allows to quantify the confidence the user can have in the results provided.</p>
<p>Our approach is closely related to [<xref ref-type="bibr" rid="CR18">18</xref>
, <xref ref-type="bibr" rid="CR19">19</xref>
], who recently relied on machine learning approaches to predict categorical antibiotic resistance phenotypes from <italic>k</italic>
-mers. The resulting prediction rules are based on the detection of possibly multiple <italic>k</italic>
-mers, which are automatically selected by the algorithm and are respectively combined by means of logical operations (conjuctions or disjunctions) or a linear combination. Likewise, [<xref ref-type="bibr" rid="CR20">20</xref>
] recently relied on a standard linear regression model to predict the Minimum Inhibitory Concentration (MIC) from candidate mutations found in pre-defined genes, and [<xref ref-type="bibr" rid="CR11">11</xref>
] explored several machine learning strategies to predict <italic>M. tuberculosis</italic>
 resistance from a pre-defined list of SNPs, with promising results. By combining <italic>k</italic>
-mers and logistic regression, we therefore aim to bridge the gap between these two approaches, hence to build prediction models allowing to combine several genomic determinants within a versatile probabilistic framework, without relying on the prior knowledge of the underlying resistance mechanisms.</p>
<p>For the sake of interpretability and computational efficiency of the prediction, we have the utmost interest in building concise models, involving as few genetic determinants as possible. From the statistical learning perspective, the challenge is to identify a small yet predictive set of <italic>k</italic>
-mers from a very large number of redundant and correlated ones (several 100.000’s). We rely for this purpose on the so-called stability selection approach [<xref ref-type="bibr" rid="CR21">21</xref>
], that consists in penalizing the logistic regression model with the sparsity-promoting Lasso penalty, within an extensive resampling procedure. We present a proof of concept of this approach for <italic>M. tuberculosis</italic>
 and <italic>S. aureus</italic>
 using existing datasets [<xref ref-type="bibr" rid="CR7">7</xref>
, <xref ref-type="bibr" rid="CR19">19</xref>
]. Our main contribution is to demonstrate that stability selection is a very efficient strategy in this context, leading to robust and extremely sparse signatures of resistance. The empirical results obtained allow furthermore to differentiate between the association and prediction perspectives on antibiotic resistance in bacteria, and suggest several promising leads for further work.</p>
</sec>
<sec id="Sec2"><title>Methods</title>
<sec id="Sec3"><title>Strain genotyping with <italic>k</italic>
-mers</title>
<p>To genotype a training panel of <italic>n</italic>
 assembled bacterial genomes using <italic>k</italic>
-mers, we first build a large matrix encoding the presence or absence of all <italic>k</italic>
-mers of length <italic>k</italic>
=31 within each strain, using the Ray software [<xref ref-type="bibr" rid="CR22">22</xref>
]. Considering <italic>k</italic>
-mers of length 31 is a safe default choice in this context, offering in general a good trade-off between sequence specificity and computational efficiency [<xref ref-type="bibr" rid="CR18">18</xref>
, <xref ref-type="bibr" rid="CR19">19</xref>
]. Since <italic>k</italic>
-mers occurring in too few or too many strains have a limited predictive power, we then discard <italic>k</italic>
-mers found in one or two strains only, or in all the strains but one or two. Finally, since this <italic>k</italic>
-mer based representation is highly redundant [<xref ref-type="bibr" rid="CR18">18</xref>
, <xref ref-type="bibr" rid="CR19">19</xref>
], we define sets of equivalent <italic>k</italic>
-mers from <italic>k</italic>
-mers having the same presence/absence profile in all the strains. For the sake of model building, we can only keep one of them, which in practice dramatically reduces the number of <italic>k</italic>
-mers to consider. We keep track of these sets of equivalent <italic>k</italic>
-mers in order to better interpret the model obtained and carry out the predictions, as discussed in “<xref rid="Sec5" ref-type="sec">Prediction from genome assembly or sequencing reads</xref>
” and “<xref rid="Sec6" ref-type="sec">Model interpretation</xref>
” sections. Equivalent <italic>k</italic>
-mers typically correspond to contiguous stretches of sequences conserved across strains, but can also correspond to non-contiguous stretches in total Linkage Disequilibrium (LD).</p>
</sec>
<sec id="Sec4"><title><italic>k</italic>
-mer selection using <italic>L</italic>
<sub>1</sub>
 penalty and stability selection</title>
<p>Logistic regression is a widely used generalized linear model addressing binary classification problems. In our case, it consists in building a linear function defined for a strain represented by a vector <italic>x</italic>
∈{0,1}<sup><italic>p</italic>
</sup> as: 
<disp-formula id="Equ1"><label>1</label>
<alternatives><tex-math id="M1">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document} $$ f(x) = \beta_{0} + \sum\limits_{j=1}^{p} \beta_{j} x_{j},  $$ \end{document}</tex-math>
<mml:math id="M2"><mml:mi>f</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub><mml:mrow><mml:mi>β</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:munderover accent="false" accentunder="false"><mml:mrow><mml:mo>∑</mml:mo>
</mml:mrow>
<mml:mrow><mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow><mml:mi>p</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub><mml:mrow><mml:mi>β</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub><mml:mrow><mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
</mml:math>
<graphic xlink:href="12859_2018_2403_Article_Equ1.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>where <italic>p</italic>
 corresponds to the number of non-redundant <italic>k</italic>
-mers obtained by the previous process, and <italic>x</italic>
 encodes the presence/absence of these <italic>k</italic>
-mers. To estimate the model coefficients and simultaneously select a limited number of <italic>k</italic>
-mers from a training panel of <italic>n</italic>
 strains, one can rely on the <italic>L</italic>
<sub>1</sub> or Lasso penalty and consider the following optimization problem: 
<disp-formula id="Equ2"><label>2</label>
<alternatives><tex-math id="M3">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document} $$ \hat{\beta} = \text{arg} \min_{\beta \in \mathbb{R}^{p+1}} \sum\limits_{i=1}^{n} L\left(y_{i}, f\left(x^{(i)}\right)\right) + \lambda \sum_{j=1}^{p} |\beta_{j}|,  $$ \end{document}</tex-math>
<mml:math id="M4"><mml:mover accent="true"><mml:mrow><mml:mi>β</mml:mi>
</mml:mrow>
<mml:mo>^</mml:mo>
</mml:mover>
<mml:mo>=</mml:mo>
<mml:mtext>arg</mml:mtext>
<mml:munder><mml:mrow><mml:mo>min</mml:mo>
</mml:mrow>
<mml:mrow><mml:mi>β</mml:mi>
<mml:mo>∈</mml:mo>
<mml:msup><mml:mrow><mml:mi>ℝ</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>p</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:munder>
<mml:munderover accent="false" accentunder="false"><mml:mrow><mml:mo>∑</mml:mo>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow><mml:mi>n</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mi>L</mml:mi>
<mml:mfenced close=")" open="(" separators=""><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>f</mml:mi>
<mml:mfenced close=")" open="(" separators=""><mml:mrow><mml:msup><mml:mrow><mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow><mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>+</mml:mo>
<mml:mi>λ</mml:mi>
<mml:munderover><mml:mrow><mml:mo>∑</mml:mo>
</mml:mrow>
<mml:mrow><mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow><mml:mi>p</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mo>|</mml:mo>
<mml:msub><mml:mrow><mml:mi>β</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>|</mml:mo>
<mml:mo>,</mml:mo>
</mml:math>
<graphic xlink:href="12859_2018_2403_Article_Equ2.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>where <italic>y</italic>
<sub><italic>i</italic>
</sub>
=1 if the <italic>i</italic>
th strain, represented by <italic>x</italic>
<sup>(<italic>i</italic>
)</sup>
, is resistant and 0 otherwise. The function <italic>L</italic>
 is the logistic loss function, which quantifies the discrepancy between the true phenotypes <italic>y</italic>
<sub><italic>i</italic>
</sub>
 of the strains and the predictions <italic>f</italic>
(<italic>x</italic>
<sup>(<italic>i</italic>
)</sup>
) obtained by the model. The <italic>λ</italic>
 parameter achieves a trade-off between this empirical error and the Lasso regularization term, and is usually optimized by cross-validation.</p>
<p>Effectively tuning the Lasso regularization parameter is however a challenging problem. Cross-validation techniques usually succeed to build models of good predictive power, but the set of variables with non-zero coefficients is known to be unstable with respect to small variations in the training dataset or in the value of regularization parameter.</p>
<p>Several resampling based strategies have been proposed to cope with this issue [<xref ref-type="bibr" rid="CR21">21</xref>
, <xref ref-type="bibr" rid="CR23">23</xref>
, <xref ref-type="bibr" rid="CR24">24</xref>
]. We rely on the stability selection approach of [<xref ref-type="bibr" rid="CR21">21</xref>
], which is illustrated in Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Figure S1. It consists in subsampling several times the entire dataset (step 1), solving for each subsample the Lasso-penalized logistic regression (step 2), and computing for each <italic>k</italic>
-mer the proportion of models in which it was selected. This process is repeated for several values of the regularization parameter, which allows to define “stability paths” quantifying the probability of selecting each <italic>k</italic>
-mer along the grid of regularization values (step 3). Instead of optimizing the regularization parameter <italic>λ</italic>
 itself, [<xref ref-type="bibr" rid="CR21">21</xref>
] propose to consider a threshold on the probability of being selected at some point on the regularization path: every <italic>k</italic>
-mer whose stability path exceeds this threshold gets ultimately selected. Having identified stable <italic>k</italic>
-mers, we finally fit a standard un-penalized logistic regression model that can be used to make predictions on new genome sequences (step 4). We note however that the issue of optimizing the regularization parameter of the Lasso is cast with this approach into that of optimizing the stability threshold, which can be done by cross-validation. To do so, we repeat the whole process described in Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Figure S1 within each cross-validation fold, as detailed in “<xref rid="Sec10" ref-type="sec">Models obtained</xref>
” section. Using this optimized probability threshold, we repeat the process on the entire dataset to build the final model. In practice, we relied on the R software, and more precisely on the glmnet package [<xref ref-type="bibr" rid="CR25">25</xref>
], to compute regularization paths. Moreover, we considered subsamples involving half of the dataset and normalized each <italic>k</italic>
-mer by its <italic>L</italic>
<sub>2</sub>
 norm to ensure an homogeneous level of penalization, as suggested and discussed in [<xref ref-type="bibr" rid="CR21">21</xref>
].</p>
</sec>
<sec id="Sec5"><title>Prediction from genome assembly or sequencing reads</title>
<p>Given sequencing data obtained from a new strain, the prediction process amounts to detecting the <italic>k</italic>
-mers selected beforehand and making a prediction based on the score provided by the final logistic regression model, which can be turned into a probability of being resistant by the logistic function. We do not tolerate any mismatch to call a <italic>k</italic>
-mer present and rely on the nucmer utility of the Mummer package [<xref ref-type="bibr" rid="CR26">26</xref>
] to do so. Predictions can be obtained from an assembled genome or directly from sequencing reads. In the latter case, however, a minimum threshold related to the sequencing depth must be considered to call a <italic>k</italic>
-mer present from its number of occurrences, in order to be robust to sequencing errors.</p>
<p>Importantly, to call present each <italic>k</italic>
-mer involved in a model, we rely on its entire set of equivalent <italic>k</italic>
-mers. We detect each of them, and consider several strategies to ultimately call the <italic>k</italic>
-mer present. The <italic>stringent</italic>
 strategy calls a <italic>k</italic>
-mer present only if all its equivalent <italic>k</italic>
-mers are detected. Conversely, the <italic>conservative</italic>
 strategy calls it present as soon as one of them is detected. Between these two possibilities, the <italic>vote</italic>
 strategy calls a <italic>k</italic>
-mer present when more than half of its equivalent <italic>k</italic>
-mers are detected, and the <italic>smooth</italic>
 strategy uses the proportion of equivalent <italic>k</italic>
-mers detected instead of a binary presence/absence call. For instance, if we detected 8 <italic>k</italic>
-mers out of a set of 10 equivalent ones, both the <italic>conservative</italic>
 and <italic>vote</italic>
 strategies would call the <italic>k</italic>
-mer present, while the <italic>stringent</italic>
 would not. The <italic>smooth</italic>
 strategy would use a value of 0.8 – instead of 1 or 0 for presence or absence respectively – thereby effectively modulating the weight given to this <italic>k</italic>
-mer by the logistic regression model, in order to account for the uncertainty in its detection. The optimal strategy to consider may in particular depend on the genomic plasticity of the bacterial species under study and the extent to which the training panel of strains properly accounts for it. Provided that the reads of the training data are available, it could easily be optimized by cross-validation techniques: the impact of the various strategies on the predictive performance could indeed be empirically measured and the best strategy retained. In this study, we solely assess the impact of the various strategies on the predictive performance, measured on independent validation sample, and leave the task of optimizing it automatically from the training data for future work. This prediction process takes a few seconds for an assembled genome and typically a couple of minutes for reads, depending on the sequencing depth (see “<xref rid="Sec11" ref-type="sec">Predictive performance</xref>
” section).</p>
</sec>
<sec id="Sec6"><title>Model interpretation</title>
<p>We aim to annotate the <italic>k</italic>
-mers included in the model, to identify whether they fall within known genes or regulatory regions, and provide their putative function when possible. We consider the set of equivalent <italic>k</italic>
-mers associated to each <italic>k</italic>
-mer included in the model and try to reconstruct the longer stretch(es) of sequence(s) that they originate from. A set of equivalent <italic>k</italic>
-mers usually corresponds to a larger sequence perfectly conserved across several strains of the panel, and sometimes to several such sequences in total LD. Following the terminology used by genome assembly algorithms, we assemble equivalent <italic>k</italic>
-mers into longer <italic>unitigs</italic>
, defined as the longest sequence(s) that can be obtained when they overlap by exactly <italic>k</italic>
−1 nucleotides, using the bcalm2 software [<xref ref-type="bibr" rid="CR27">27</xref>
]. A unitig is interesting to represent a set of equivalent <italic>k</italic>
-mers because it has the same presence/absence profile on the training genomes, but is in general longer that the individual equivalent <italic>k</italic>
-mers, hence easier to annotate. Unitigs are finally aligned against one or several annotated reference genome(s) using the blastn-short program with a minimum of 80% identity and 85% coverage as filtering parameters, which, although not deeply optimized, gave satisfactory results in our experiments.</p>
</sec>
</sec>
<sec id="Sec7"><title>Results and discussion</title>
<p>We now present a proof of concept of the method described in the previous section on two bacterial species, <italic>M. tuberculolis</italic>
 and <italic>S. aureus</italic>
.</p>
<sec id="Sec8"><title><italic>M. tuberculosis</italic>
 study</title>
<sec id="Sec9"><title>Dataset constitution</title>
<p>We gathered two datasets from previous studies. The training dataset was taken from [<xref ref-type="bibr" rid="CR19">19</xref>
], who recently made available a set of 1306 assembled <italic>M. tuberculosis</italic>
 genomes, together with binary resistance phenotypes relating to 7 drugs, namely ethambutol, ethionamide, isoniazid, kanamycin, ofloxacin, rifampicin, and streptomycin. The resistance phenotype of each strain was not always available for each drug, but for each drug the number of resistant and susceptible strains was reasonably high (Table <xref rid="Tab1" ref-type="table">1</xref>
). The validation dataset involved 1586 strains<xref ref-type="fn" rid="Fn1">1</xref>
 that were previously used to validate the performance of the Mykrobe software [<xref ref-type="bibr" rid="CR7">7</xref>
]. Binary resistance phenotypes were available for 5 out of the 7 antibiotics included in the training panel, but not for ethionamide and ofloxacin. We used moxifloxacin as a proxy for ofloxacin resistance, since these two drugs belong to the same family and exhibit a very high level of cross-resistance [<xref ref-type="bibr" rid="CR28">28</xref>
]. We note that a variable number of each phenotype was available for each antibiotic, with in particular only nine strains resistant to kanamycin and ofloxacin (Table <xref rid="Tab2" ref-type="table">2</xref>). This made estimating the sensitivity of the predictive models for these drugs uncertain. We also note that we directly worked from raw sequencing data to obtain predictions, that is, without relying on a prior step of genome assembly.
<table-wrap id="Tab2"><label>Table 2</label>
<caption><p><italic>M. tuberculosis</italic>
 study: validation results</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left"></th>
<th align="left"></th>
<th align="left" colspan="2">Stability</th>
<th align="left" colspan="2">Mykrobe</th>
</tr>
<tr><th align="left"></th>
<th align="left">R / S</th>
<th align="left">sensi.</th>
<th align="left">speci.</th>
<th align="left">sensi.</th>
<th align="left">speci.</th>
</tr>
</thead>
<tbody><tr><td align="left">ethambutol</td>
<td align="left">194 / 1391</td>
<td align="left">60.3 (6.9)</td>
<td align="left">97.5 (0.8)</td>
<td align="left">71.6 (6.3)</td>
<td align="left">95.8 (1.1)</td>
</tr>
<tr><td align="left">isoniazid</td>
<td align="left">370 / 1216</td>
<td align="left">89.7 (3.1)</td>
<td align="left">97.5 (0.9)</td>
<td align="left">84.3 (3.7)</td>
<td align="left">98.6 (0.7)</td>
</tr>
<tr><td align="left">kanamycin</td>
<td align="left">9 / 460</td>
<td align="left">33.3 (30.8)</td>
<td align="left">98.9 (0.9)</td>
<td align="left">33.3 (30.8)</td>
<td align="left">99.6 (0.6)</td>
</tr>
<tr><td align="left">ofloxacin</td>
<td align="left">9 / 478</td>
<td align="left">55.6 (32.5)</td>
<td align="left">99.6 (0.6)</td>
<td align="left">55.6 (32.5)</td>
<td align="left">100 (0)</td>
</tr>
<tr><td align="left">rifampicin</td>
<td align="left">303 / 1262</td>
<td align="left">94.1 (2.7)</td>
<td align="left">99 (0.5)</td>
<td align="left">93.7 (2.7)</td>
<td align="left">99 (0.5)</td>
</tr>
<tr><td align="left">streptomycin</td>
<td align="left">353 / 1227</td>
<td align="left">77.9 (4.3)</td>
<td align="left">99.1 (0.5)</td>
<td align="left">78.8 (4.3)</td>
<td align="left">99.3 (0.5)</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>R / S: number of resistant and susceptible strains. Stability : sensitivity (sensi.) and specificity (speci.) values obtained with the stability-based final models. Mykrobe: sensitivity and specificity values obtained with the Mykrobe predictor. Figures into brackets correspond to half of the width of the 95% confidence intervals (CI) that shoud be added and substrated to get the upper and lower bounds of the 95% CI</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>A total of 19,876,230 distinct <italic>k</italic>
-mers of length 31 were obtained from the 1306 training (assembled) genomes. 5,113,633 <italic>k</italic>
-mers remained after filtering those occurring in less than three strains or in all the strains but one or two, which approximately resulted in a fourfold reduction. This set of <italic>k</italic>
-mers corresponded to 151,403 sets of equivalent <italic>k</italic>
-mers, among which a single one was randomly picked to define the set of non-redundant candidate variables to learn the models. This drastic reduction of the number of <italic>k</italic>
-mers was due to the high clonality of <italic>M. tuberculosis</italic>
 genomes.</p>
</sec>
<sec id="Sec10"><title>Models obtained</title>
<p>As described in “<xref rid="Sec4" ref-type="sec"><italic>k</italic>
-mer selection using <italic>L</italic>
<sub>1</sub>
 penalty and stability selection</xref>
” section and illustrated in Fig. <xref rid="Fig1" ref-type="fig">1</xref>
, the sole parameter to optimize in our approach is the probability threshold of the stability selection procedure, which defines the set of stable <italic>k</italic>
-mers. In this study, it was optimized by cross-validation over the grid {0.6;0.65;0.7;0.75;0.8}, for each antibiotic. Three repetitions of a 10-fold cross-validation process were carried out and the value of the parameter was chosen according to the average Area Under the (Receiver Operating Characteristic – ROC) Curve (AUC) obtained: the highest threshold allowing to reach the highest AUC value up to one point was retained, which allowed to favour sparser models for a comparable accuracy. The same cross-validation procedure was also applied to evaluate the standard <italic>L</italic>
<sub>1</sub>
-penalized logistic-regression approach. In both cases we considered a grid of 200 candidate values of the regularization parameter. It was defined by the glmnet software and ranged in a log scale from a maximum value defined as the smallest value ensuring that at least one variable is selected (i.e., has a non-null coefficient), to a minimum value defined as this maximum value divided by 10<sup>4</sup>. For the stability selection approach, we resampled 100 times the training dataset.
<fig id="Fig1"><label>Fig. 1</label>
<caption><p>Illustration of the stability selection process for ethambutol. Left: stability paths. Each curve corresponds to a <italic>k</italic>
-mer and represents its selection frequency over all the resampled datasets, across the values of the regularization paramater. Darker red curves correspond to larger selection probabilities (from 0.6 to 0.8), while grey curves correspond to <italic>k</italic>
-mers with probability of selection below 0.6. Middle : the regularization path obtained by fitting a <italic>L</italic>
<sub>1</sub>
 penalized logistic regression model across the entire dataset, with <italic>k</italic>
-mers colored according to the color code defined from the left panel. Right : number of <italic>k</italic>
-mers selected by the stability selection approach for thresholds ranging from 0.6 to 0.8</p>
</caption>
<graphic xlink:href="12859_2018_2403_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
<p>Table <xref rid="Tab1" ref-type="table">1</xref>
 summarizes the predictive performance and number of <italic>k</italic>
-mers selected. As expected from [<xref ref-type="bibr" rid="CR21">21</xref>
], the stability procedure led to sparser signatures than the classical <italic>L</italic>
<sub>1</sub>
 penalization, especially for the ethambutol, kanamycin, and streptomycin antibiotics, for a slight decrease in terms of AUC. The predictive performance remained comparable, with the largest drop of 3 points observed for ethioniamide. Figure <xref rid="Fig1" ref-type="fig">1</xref>
 illustrates the stability selection process for ethambutol. We noted in this example that some<italic>k</italic>
-mers with a relatively high probability of being selected (e.g., ≥0.7, in orange), hence likely to be important to obtain accurate predictions, entered relatively late in the global regularization path of the Lasso, and vice-versa, which explained why the Lasso model involved many more <italic>k</italic>
-mers. Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Figures S2 and S3 show the same curves for the other antibiotics.</p>
</sec>
<sec id="Sec11"><title>Predictive performance</title>
<p>We then evaluated the models obtained by stability-selection on the sequencing reads of the validation panel. We considered thresholds to call a <italic>k</italic>
-mer present from its number of occurrences in reads ranging from 1 to 50, and the four strategies mentioned in “<xref rid="Sec5" ref-type="sec">Prediction from genome assembly or sequencing reads</xref>
” section to call a <italic>k</italic>
-mer involved in a model present, based on the detection of its equivalent <italic>k</italic>
-mers. While the threshold on the number of occurrences could be optimized for each sample, we systematically set it to 10. No major difference was observed as soon as it was not too low (e.g., 1 or 2 leading to false positive <italic>k</italic>
-mer detection) or not too high (e.g., 25 or 50 missing some <italic>k</italic>
-mers), as illustrated in Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Figure S4. The differences observed between the various summarization strategies were also minor, and we systematically relied on the stringent approach.</p>
<p>Table <xref rid="Tab2" ref-type="table">2</xref>
 showed that the performances obtained on the validation dataset by our approach and the Mykrobe predictor software were comparable. The most important differences were observed for the antibiotics ethambutol and isoniazid, where Mykrobe showed a higher and lower sensitivity, with an opposite effect on specificity. A ROC curve analysis indicated however that the sensitivity/specificity trade-offs achieved by Mykrobe could be met by modifying the decision threshold of the regression logistic model, which was set by default to 0.5. This is illustrated in Fig. <xref rid="Fig2" ref-type="fig">2</xref>
 for ethambutol, and in Additional file <xref rid="MOESM1" ref-type="media">1</xref>: Figure S5 for the other antibiotics. The flexibility offered by the logistic regression model to control the trade-off between sensitivity and specificity can be useful in a diagnostics context to meet the target performance set by regulatory agencies.
<fig id="Fig2"><label>Fig. 2</label>
<caption><p>Ethambutol ROC curve obtained using the <italic>k</italic>
-mers based signature evaluated in the validation dataset. The orange dot represents the performance obtained by the Mykrobe predictor, and the blue one to our <italic>k</italic>
-mer based approach when using the default threshold of 0.5 to predict a strain resistant based on the probability provided by the logistic regression model. Arrows represent 95% confidence intervals</p>
</caption>
<graphic xlink:href="12859_2018_2403_Fig2_HTML" id="MO2"></graphic>
</fig>
</p>
<p>We noted finally that the sequencing depth greatly varied in the validation set, with a minimum value around 20 and a maximum greater than 700. The prediction time increased linearly with the sequencing depth, showing that 1.25 s were necessary to process each coverage unit, which allowed to process a sample with a 100x coverage in about two minutes (Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Figure S6).</p>
</sec>
<sec id="Sec12"><title>Model interpretation</title>
<p>To interpret the predictive models obtained, sets of equivalent <italic>k</italic>
-mers were assembled into unitigs, as described in “<xref rid="Sec6" ref-type="sec">Model interpretation</xref>
” section, and blasted against the H37Rv reference genome. The unitigs were highly conserved, with a coverage equal to 100% for all the unitigs and a minimum percent identify equal to 96.7% (rifampicin model, unitig #3). Each set of equivalent <italic>k</italic>
-mers corresponded to a single unitig whose length ranged from 31 (the size of the individual <italic>k</italic>
-mers) to 61 nucleotides, with a median value of 44.5 nucleotides (Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Figure S7). These unitigs were easier to annotate than the individual <italic>k</italic>
-mers because they were more specific of particular genomic regions, hence led to less ambiguous blast hits.</p>
<p>Figure <xref rid="Fig3" ref-type="fig">3</xref>
 represents the models and annotations obtained. Interestingly, without any any prior information of known resistance determinants, a total of 22 unitigs was retained (1 to 8 per antibiotic), relating to 10 genes or RNA often already known to be associated with <italic>M. tuberculosis</italic>
 antibiotic resistance, with unitigs originating from the <italic>embB</italic>
 gene for ethambutol, <italic>fabG1</italic>
 for ethionamide, <italic>katG</italic>
 and <italic>fabG1</italic>
 for isoniazid, <italic>rss</italic>
 and <italic>eis</italic>
 for kanamycin, <italic>gyrA</italic>
 for ofloxacin, <italic>rpoB</italic>
 for rifampicin, and <italic>rss</italic>
 and <italic>rpsL</italic>
 for streptomycin [<xref ref-type="bibr" rid="CR12">12</xref>
, <xref ref-type="bibr" rid="CR29">29</xref>
]. Note that the <italic>fabG1</italic>
 gene is located just before the <italic>inhA</italic>
 gene, which is one of the two main markers of resistance to isoniazid and ethionamide. Some mutations associated to resistance said to originate from the promoter region of <italic>inhA</italic>
, as for instance in [<xref ref-type="bibr" rid="CR9">9</xref>
], could actually be considered to fall within the <italic>fabG1</italic> gene.
<fig id="Fig3"><label>Fig. 3</label>
<caption><p>Signatures annotation at the unitig level. In total, 22 unitigs falling in 10 genes were retained. Known target / antibiotics association are shown on the right hand side of the figure. Figures correspond to <italic>β</italic>
 coefficients in the unpenalized final logistic model and colors to their magnitude (in absolute value). A negative coefficient leads to a decreased risk of resistance, <italic>i.e.</italic>
, the presence of the corresponding unitig in the strain genome is associated with a decreased risk of antibiotic resistance, and conversely for a positive coefficient. A strain is predicted as resistant is the resulting score, taking into account the intercept of the model, is positive</p>
</caption>
<graphic xlink:href="12859_2018_2403_Fig3_HTML" id="MO3"></graphic>
</fig>
</p>
<p>We noted that some <italic>k</italic>
-mers were part of in several signatures. This was probably due in large part to the level of correlation between resistance phenotypes observed within the training panel (Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Figure S8). While in some cases such correlation is expected, for instance between rifampicin and isoniazid or between ethionamide and isoniazid [<xref ref-type="bibr" rid="CR12">12</xref>
], it may also be a consequence of a peculiar constitution of the training dataset due to a sampling bias, as discussed in “<xref rid="Sec14" ref-type="sec">Conclusion</xref>
” section.</p>
<p>Interestingly, two <italic>k</italic>
-mers involved in four signatures originated from the 16S RNA (<italic>rrs</italic>
). While this could be expected for kanamycin and streptomycin [<xref ref-type="bibr" rid="CR12">12</xref>
], we noted also that these <italic>k</italic>
-mers captured general population or “clade” effects, as illustrated in Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Figure S9. This suggested that resistance may in some cases be intrinsically related to the evolutionary history of the strains, which makes identifying causal resistance determinants difficult [<xref ref-type="bibr" rid="CR17">17</xref>
].</p>
<p>A striking observation was that the predictive models involved much fewer genetic determinants than alternative approaches relying on catalogues of mutations. For instance, [<xref ref-type="bibr" rid="CR9">9</xref>
] compiled a library of 1345 mutations used to predict the resistance to 15 antibiotics. Narrowing the list to the 7 antibiotics considered in this study led to a list of 745 mutations, much greater than the number of <italic>k</italic>
-mers obtained by our approach. There are at least two reasons for that. First, the library of mutations was obtained by compiling results of several studies, involving altogether probably a much greater number of strains than used here. Reproducing our approach on a more exhaustive dataset would most likely lead to selecting more <italic>k</italic>
-mers. However, we noted that <italic>k</italic>
-mers allowed to represent in a concise way complex patterns of genetic variations, as illustrated in Fig. <xref rid="Fig4" ref-type="fig">4</xref>
 for 3 genes included in the ofloxacin, isoniazid, and rifampicin signatures. These graphs were obtained by mapping the unitigs corresponding to the <italic>k</italic>
-mers of the signatures against the training genomes, extracting and aligning the unique hits obtained and eventually representing the unitigs on the aligned haplotypes. In the upper panel, the unitig falling in the <italic>gyrA</italic>
 gene (obtained from 3 equivalent <italic>k</italic>
-mers) captured a single SNP: a nucleotide other than “G” at the SNP position increased the risk of resistance to ofloxacin. In this simple example, 1 <italic>k</italic>
-mer (or equivalently 1 unitig) indeed corresponded to a single mutation. It appears at position 7581 of the H37Rv chromosome and has already been widely described [<xref ref-type="bibr" rid="CR9">9</xref>
]. In the middle and lower panels, however, unitigs captured more complex resistance determinant patterns than a single SNP. The middle panel showed that a single unitig captured the increased risk of resistance to isoniazid brought by two distinct well known SNPs [<xref ref-type="bibr" rid="CR9">9</xref>
], occurring at positions 2155167 and 2155168 in the <italic>katG</italic>
 gene, and defining 5 distinct haplotypes. The genomic variability observed in the <italic>rpoB</italic>
 gene in the lower panel was even more complex. We noted indeed that the unitigs corresponding to three <italic>k</italic>
-mers of the signature fell in the same region of the gene. Eight SNPs were observed in the training panel in this region between positions 761109 and 761161. They fall in the well-known rifampicin resistance-determining region [<xref ref-type="bibr" rid="CR12">12</xref>
], and all of them have been described in [<xref ref-type="bibr" rid="CR9">9</xref>
], except that observed at position 761156. Using only 3 <italic>k</italic>-mers, the model could therefore account for a much greater number of SNPs combinations, these eight SNPs defining 16 haplotypes in the training panel.
<fig id="Fig4"><label>Fig. 4</label>
<caption><p>Examples of genomic variation patterns captured by <italic>k</italic>
-mers in the ofloxacin, isoniazid, and rifampicin signatures. For each signature, mutliple alignments of haplotypes found in the training dataset are shown. Unitigs are surrounded by colored boxes and coordinates refer to nucleic positions on the H37Rv chromosome (NC_000962.3). <bold>a</bold>
 Ofloxacin - DNA gyrase subunit A - gene gyrA: a single SNP in <italic>gyrA</italic>
 predicts oxfloxacin resistance. At the SNP position, the 4 bases can be observed in the training dataset, the haplotype with the “G” being wild-type sensitive phenotype. <bold>b</bold>
 Isoniazid - catalase peroxidase - gene katG: 2 SNPs in <italic>katG</italic>
 predict isoniazid resistance, and these 2 SNPs are captured by a single unitig. <bold>c</bold>
 Rifampicin - DNA directed RNA polymerase beta subunit - gene rpoB: 8 SNPs in <italic>rpoB</italic>
 predict rifampicin resistance, and these 8 SNPs are captured by 3 unitigs</p>
</caption>
<graphic xlink:href="12859_2018_2403_Fig4_HTML" id="MO4"></graphic>
</fig>
</p>
<p>We note finally that the model coefficients allowed to measure the relative importance of the various <italic>k</italic>
-mers defining the signatures. These coefficients could indeed be interpreted as odds ratio, a well known statistical indicator measuring the strength of association between a variable and an outcome. We emphasize also that they defined, through the logistic regression model, a probability of resistance, a positive coefficient indicating a higher risk of resistance. This property was interesting because it allowed us to associate a level of confidence to the prediction, and to control the trade-off between sensitivity and specificity that can be achieved by the model (see “<xref rid="Sec11" ref-type="sec">Predictive performance</xref>
” section).</p>
</sec>
</sec>
<sec id="Sec13"><title><italic>S. aureus</italic>
 study</title>
<p>We applied the same procedure to <italic>S. aureus</italic>
. The training dataset was taken from [<xref ref-type="bibr" rid="CR6">6</xref>
]. It involved 501 genomes and 6 antibiotics, namely ciprofloxacin, erythromycin, fusidic acid, methicillin, penicillin and tetracyclin. Three other antibiotics were not considered here because the number of resistant strains was too limited (7 and 3 out of 501 for rifampicin and gentamicin respectively, and 2 out of 176 for mupirocin). The validation dataset included 470 genomes and phenotypes, and was also used in [<xref ref-type="bibr" rid="CR7">7</xref>
] to demonstrate the performance of the Mykrobe predictor software. Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Table S1 gives the number of resistant and susceptible strains for these 6 antibiotics.</p>
<p>Table <xref rid="Tab3" ref-type="table">3</xref>
 summarizes the results obtained. We first noted that the stability selection approach also led to very sparse models (1 to 5 <italic>k</italic>
-mers per model), often sparser than the classical Lasso. Predictive performance was in general comparable or slightly lower than that obtained by Mykrobe, except for fusidic acid where our model significantly lacked sensitivity. This was also the case, to a lesser extent, for ciprofloxacin where a drop of almost 6 points was observed, which corresponded to a difference of 4 strains out of 65 resistant ones and was not significant, as can be seen from the associated confidence intervals. We noted an effect of the strategy used to call present a <italic>k</italic>
-mer involved in a signature from its set of equivalent <italic>k</italic>
-mers. We observed indeed that while the stringent strategy was appropriate in most cases, better results could be obtained for penicillin using the smooth strategy and for tetracyclin using either the vote or the conservative strategy. Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Figure S10 shows the impact of the various strategies, as well as the threshold on the number of <italic>k</italic>
-mer occurrences, on the predictive performance. Table <xref rid="Tab3" ref-type="table">3</xref> gives the best result obtained for each antibiotic, with a threshold set to ten.
<table-wrap id="Tab3"><label>Table 3</label>
<caption><p><italic>S. aureus</italic>
 study: validation results</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left"></th>
<th align="left" colspan="3">Stability</th>
<th align="left" colspan="2">Mykrobe</th>
</tr>
<tr><th align="left"></th>
<th align="left"><italic>k</italic>
-mers</th>
<th align="left">sensi.</th>
<th align="left">speci.</th>
<th align="left">sensi.</th>
<th align="left">speci.</th>
</tr>
</thead>
<tbody><tr><td align="left">ciprofloxacin</td>
<td align="left">1 (18)</td>
<td align="left">89.2 (7.5)</td>
<td align="left">99.8 (0.5)</td>
<td align="left">95.4 (5.1)</td>
<td align="left">99.8 (0.5)</td>
</tr>
<tr><td align="left">erythromycin</td>
<td align="left">3 (8)</td>
<td align="left">96.2 (4.2)</td>
<td align="left">99.5 (0.7)</td>
<td align="left">98.7 (2.5)</td>
<td align="left">100 (0)</td>
</tr>
<tr><td align="left">fusidic acid</td>
<td align="left">3 (26)</td>
<td align="left">78 (12.7)</td>
<td align="left">100 (0)</td>
<td align="left">100 (0)</td>
<td align="left">99.1 (0)</td>
</tr>
<tr><td align="left">methicillin</td>
<td align="left">1 (1)</td>
<td align="left">98.1 (3.6)</td>
<td align="left">100 (0)</td>
<td align="left">100 (0)</td>
<td align="left">100 (0)</td>
</tr>
<tr><td align="left">penicillin</td>
<td align="left">1 (1)</td>
<td align="left">99.7 (0.5)</td>
<td align="left">88.3 (6.5)</td>
<td align="left">99.7 (0.5)</td>
<td align="left">88.3 (6.5)</td>
</tr>
<tr><td align="left">tetracyclin</td>
<td align="left">5 (7)</td>
<td align="left">100 (0)</td>
<td align="left">99.8 (0.4)</td>
<td align="left">100 (0)</td>
<td align="left">99.8 (0.4)</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>R / S: number of R/S strains. Stability: sensitivity (sensi.) and specificity (speci.) values obtained with the stability-based final models. Mykrobe: sensitivity and specificity values obtained with the Mykrobe predictor. Figures into brackets correspond to half of the width of the 95% confidence intervals (CI) that shoud be added and substrated to get the upper and lower bound of the 95% CI</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>Finally, Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Table S2 presents the annotations of the unitigs obtained. Starting without a priori from more than 18 million <italic>k</italic>
-mers, reduced to 335.238 filtered and non-redundant ones, known resistance determinants were identified for all drugs. We noted however that sets of equivalent <italic>k</italic>
-mers were sometimes assembled into several unitigs, which therefore corresponded to non contiguous stretches of sequences in total LD within the training dataset. This was in particular the case for penicillin and tetracyclin, which may explain why the stringent <italic>k</italic>
-mer summarization strategy was not appropriate for these antibiotics.</p>
</sec>
</sec>
<sec id="Sec14" sec-type="conclusion"><title>Conclusion</title>
<p>We applied a machine learning approach to predict bacterial resistance phenotypes, starting from their whole genome sequences and without any prior information about the underlying resistance determinants. Using a penalized logistic regression model, coupled with a stability selection approach, we obtained predictive models involving a very limited number of <italic>k</italic>
-mers, yet allowing to reach a performance comparable to alternative state of the art bioinformatics strategies for two bacterial species. The <italic>k</italic>
-mers obtained uncovered previously known resistance determinants, thereby confirming that such a data driven strategy is promising to unravel bacterial genotype-phenotype relationships [<xref ref-type="bibr" rid="CR18">18</xref>
, <xref ref-type="bibr" rid="CR19">19</xref>
]. The approach is generic and could readily be transposed to new bacterial species, and/or phenotypes, provided that adequate training data is available.</p>
<p>By selecting them in a discriminative fashion, our data-driven strategy allows to build complex prediction rules from a limited number of <italic>k</italic>
-mers, as shown here for <italic>M. tuberculosis</italic>
. A potential drawback of this approach, however, is that it intrinsically relies on the level of information provided by the training dataset. In particular, if the level of genomic variability around a causal determinant (e.g., a SNP) is under-represented in the training data because of a sampling bias, our approach will lead us to build too large sets of equivalent <italic>k</italic>
-mers, that will not be detected as such in a new strain showing a different genomic context around the determinant. The various strategies proposed to summarize equivalent <italic>k</italic>
-mers may therefore be useful to cope with this issue. Moreover, the study led on <italic>M. tuberculosis</italic>
 revealed that this approach is sensitive to the level of correlation between phenotypes. Some <italic>k</italic>
-mers capturing resistance determinants within the target of a given antibiotic were involved in the model predicting resistance to another antibiotic, simply because strains of the training panels tended to be resistant to both antibiotics. <italic>M. tuberculosis</italic>
 strains may be simultaneously resistant to several antibiotics, which is in part due to the fact that therapies involve antibiotic “cocktails” [<xref ref-type="bibr" rid="CR13">13</xref>
]. In a predictive context where correlation between antibiotics is a biological reality, we consider that such correlation patterns may actually be helpful and leveraged by the model. The overlap we observed between signatures may actually capture some synergistic effects driving simultaneously the resistance to several antibiotics. Coll et al. [<xref ref-type="bibr" rid="CR9">9</xref>
] showed that specific mutations tended to co-occur among multi-drug resistant strains, and evidence has been reported on other bacterial species that a mutation conferring resistance to a given antibiotic could also increase the level of resistance to other antibiotics [<xref ref-type="bibr" rid="CR30">30</xref>
]. This observation also suggests, however, that explicitly learning jointly these predictive models within a multi-task learning framework may be a promising way to exploit such correlation patterns. Several extensions of the Lasso have been proposed to enforce tasks to share a common support, depending on their level of correlation [<xref ref-type="bibr" rid="CR31">31</xref>
]. They could provide an interesting way to study and leverage such cross-resistance mechanisms. If however this correlation is specific to the training dataset, hence results from a sampling bias, it can clearly compromise the generalization of the model. Drouin et al. [<xref ref-type="bibr" rid="CR18">18</xref>
] and Davis et al. [<xref ref-type="bibr" rid="CR19">19</xref>
] proposed different strategies to compensate for this correlation while learning predictive models, relying respectively on a subsampling of the dataset, or a post-processing of the list of <italic>k</italic>
-mers selected. Within the penalized regression framework considered here, an alternative strategy could be to rely on multi-task appproaches enforcing tasks to have disjoint supports [<xref ref-type="bibr" rid="CR32">32</xref>
]. More generally, correlation between phenotypes is an issue to establish causal relationships between genetic determinants and antibiotic resistance. It represents a confounding factor for bacterial genome-wide association studies that should be taken into account, as it is commonplace for population structure [<xref ref-type="bibr" rid="CR17">17</xref>
].</p>
<p>The versatile framework of the penalized logistic regression offers many perspectives to further investigate bacterial genotypes/phenotypes relationships. Besides the extensions to the multi-task setting mentioned above to model cross-resistance mechanisms, it can easily be extended to consider semi-quantitative measurement of antibiotic resistance, using the MIC as outcome for an ordinal regression model [<xref ref-type="bibr" rid="CR33">33</xref>
]. It is indeed known that mutations can sometimes induce a variable level of resistance [<xref ref-type="bibr" rid="CR12">12</xref>
], and working directly from MICs may lead to better models. Likewise, correlation between strains due to their underlying population structure should be taken into account, by reducing the loss incurred by close strains or relying on mixed-model strategies [<xref ref-type="bibr" rid="CR34">34</xref>
].</p>
<p>In terms of diagnostics, the ability to carry out the prediction from reads coupled with the emergence of nanopore technologies paves the way to real-time sequencing-based applications [<xref ref-type="bibr" rid="CR35">35</xref>
]. A recent proof of concept, led on <italic>M. tuberculosis</italic>
 and starting from direct respiratory samples, demonstrated the feasibility of this approach [<xref ref-type="bibr" rid="CR36">36</xref>
]. How such <italic>k</italic>
-mer based strategies could be transposed to metagenomics settings, in order to predict resistance directly from a sample, with a higher level of sequencing noise, remains an open question and will be the purpose of future work.</p>
</sec>
<sec sec-type="supplementary-material"><title>Additional files</title>
<sec id="Sec15"><p><supplementary-material content-type="local-data" id="MOESM1"><media xlink:href="12859_2018_2403_MOESM1_ESM.pdf"><label>Additional file 1</label>
<caption><p>Supplementary information and figures. (PDF 2415 kb)</p>
</caption>
</media>
</supplementary-material>
</p>
<p><supplementary-material content-type="local-data" id="MOESM2"><media xlink:href="12859_2018_2403_MOESM2_ESM.csv"><label>Additional file 2</label>
<caption><p>Summary of the training dataset involved in the <italic>M. tuberculosis</italic>
 study. (CSV 189 kb)</p>
</caption>
</media>
</supplementary-material>
</p>
<p><supplementary-material content-type="local-data" id="MOESM3"><media xlink:href="12859_2018_2403_MOESM3_ESM.csv"><label>Additional file 3</label>
<caption><p>Summary of the test dataset involved in the <italic>M. tuberculosis</italic>
 study. (CSV 352 kb)</p>
</caption>
</media>
</supplementary-material>
</p>
<p><supplementary-material content-type="local-data" id="MOESM4"><media xlink:href="12859_2018_2403_MOESM4_ESM.csv"><label>Additional file 4</label>
<caption><p>Summary of the dataset (training and test) involved in the <italic>S. aureus</italic>
 study. (CSV 213 kb)</p>
</caption>
</media>
</supplementary-material>
</p>
</sec>
</sec>
</body>
<back><glossary><title>Abbreviations</title>
<def-list><def-item><term>AUC</term>
<def><p>Area under the (ROC) curve</p>
</def>
</def-item>
<def-item><term>LD</term>
<def><p>Linkage disequilibrium</p>
</def>
</def-item>
<def-item><term>MIC</term>
<def><p>Minimum inhibitory concentration</p>
</def>
</def-item>
<def-item><term>NGS</term>
<def><p>Next-generation sequencing</p>
</def>
</def-item>
<def-item><term>ROC</term>
<def><p>Receiver operating characteristic</p>
</def>
</def-item>
<def-item><term>SNP</term>
<def><p>Single-nucleotide polymorphism</p>
</def>
</def-item>
</def-list>
</glossary>
<fn-group><fn id="Fn1"><label>1</label>
<p>The original panel involved 1609 strains but 23 of them have missing phenotypes for all the drugs considered in this study.
<table-wrap id="Tab1"><label>Table 1</label>
<caption><p><italic>M. tuberculosis</italic>
 study: cross-validation results</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left"></th>
<th align="left">R / S</th>
<th align="left">k-mers</th>
<th align="left">AUC</th>
<th align="left"><italic>p</italic>
</th>
</tr>
</thead>
<tbody><tr><td align="left">ethambutol</td>
<td align="left">333 / 712</td>
<td align="left">8 (79)</td>
<td align="left">91.0 (92.0)</td>
<td align="left">0.75</td>
</tr>
<tr><td align="left">ethioniamide</td>
<td align="left">172 / 250</td>
<td align="left">3 (9)</td>
<td align="left">82.2 (85.3)</td>
<td align="left">0.7</td>
</tr>
<tr><td align="left">isoniazid</td>
<td align="left">815 / 472</td>
<td align="left">4 (5)</td>
<td align="left">96.2 (96.0)</td>
<td align="left">0.8</td>
</tr>
<tr><td align="left">kanamycin</td>
<td align="left">187 / 484</td>
<td align="left">2 (49)</td>
<td align="left">90.1 (93.2)</td>
<td align="left">0.8</td>
</tr>
<tr><td align="left">ofloxacin</td>
<td align="left">238 / 458</td>
<td align="left">1 (1)</td>
<td align="left">91.2 (91.1)</td>
<td align="left">0.8</td>
</tr>
<tr><td align="left">rifampicin</td>
<td align="left">668 / 533</td>
<td align="left">7 (6)</td>
<td align="left">96.6 (96.6)</td>
<td align="left">0.8</td>
</tr>
<tr><td align="left">streptomycin</td>
<td align="left">492 / 678</td>
<td align="left">7 (22)</td>
<td align="left">90.8 (92.3)</td>
<td align="left">0.8</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>R / S: number of resistant and susceptible strains. <italic>k</italic>
-mers: number of <italic>k</italic>
-mers in final models obtained with the stability selection approach and with the classical lasso penalty (between brackets). AUC: cross-validated AUC obtained with the stability selection approach and with the classical lasso penalty (within brackets). <italic>p</italic>
: probability threshold selected for the stability approach</p>
</table-wrap-foot>
</table-wrap>
</p>
</fn>
</fn-group>
<ack><sec id="d29e2035"><title>Availability of data and materials</title>
<p>All the datasets involved in this study are available in public repositories, as described in Additional file <xref rid="MOESM1" ref-type="media">1</xref>
: Section S4. CSV files providing links to sequence data and reference phenotypes are available as supplementary materials (Additional file <xref rid="MOESM2" ref-type="media">2</xref>
, Additional file <xref rid="MOESM3" ref-type="media">3</xref>
, Additional file <xref rid="MOESM4" ref-type="media">4</xref>
). A R function implementing the stability selection procedure is available upon request to the authors.</p>
</sec>
</ack>
<notes notes-type="author-contribution"><title>Authors’ contributions</title>
<p>PM and MT conceived the study. PM conducted the experiments. PM and MT analyzed the results and wrote the manuscript. Both authors read and approved the final manuscript.</p>
</notes>
<notes notes-type="COI-statement"><sec><title>Ethics approval and consent to participate</title>
<p>Not applicable.</p>
</sec>
<sec><title>Consent for publication</title>
<p>Not applicable.</p>
</sec>
<sec><title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec><title>Publisher’s Note</title>
<p>Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.</p>
</sec>
</notes>
<ref-list id="Bib1"><title>References</title>
<ref id="CR1"><label>1</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Loman</surname>
<given-names>NJ</given-names>
</name>
<name><surname>Constantinidou</surname>
<given-names>C</given-names>
</name>
<name><surname>Chan</surname>
<given-names>JZ</given-names>
</name>
<name><surname>Halachev</surname>
<given-names>M</given-names>
</name>
<name><surname>Sergeant</surname>
<given-names>M</given-names>
</name>
<name><surname>Penn</surname>
<given-names>CW</given-names>
</name>
<name><surname>Robinson</surname>
<given-names>ER</given-names>
</name>
<name><surname>Pallen</surname>
<given-names>MJ</given-names>
</name>
</person-group>
<article-title>High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunity,</article-title>
<source>Nat Rev Microbiol</source>
<year>2012</year>
<volume>10</volume>
<issue>9</issue>
<fpage>599</fpage>
<lpage>606</lpage>
<pub-id pub-id-type="doi">10.1038/nrmicro2850</pub-id>
<pub-id pub-id-type="pmid">22864262</pub-id>
</element-citation>
</ref>
<ref id="CR2"><label>2</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Chan</surname>
<given-names>Jacqueline Z-M</given-names>
</name>
<name><surname>Pallen</surname>
<given-names>Mark J</given-names>
</name>
<name><surname>Oppenheim</surname>
<given-names>Beryl</given-names>
</name>
<name><surname>Constantinidou</surname>
<given-names>Chrystala</given-names>
</name>
</person-group>
<article-title>Genome sequencing in clinical microbiology</article-title>
<source>Nature Biotechnology</source>
<year>2012</year>
<volume>30</volume>
<issue>11</issue>
<fpage>1068</fpage>
<lpage>1071</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.2410</pub-id>
</element-citation>
</ref>
<ref id="CR3"><label>3</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bertelli</surname>
<given-names>C.</given-names>
</name>
<name><surname>Greub</surname>
<given-names>G.</given-names>
</name>
</person-group>
<article-title>Rapid bacterial genome sequencing: methods and applications in clinical microbiology</article-title>
<source>Clinical Microbiology and Infection</source>
<year>2013</year>
<volume>19</volume>
<issue>9</issue>
<fpage>803</fpage>
<lpage>813</lpage>
<pub-id pub-id-type="doi">10.1111/1469-0691.12217</pub-id>
<pub-id pub-id-type="pmid">23601179</pub-id>
</element-citation>
</ref>
<ref id="CR4"><label>4</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Didelot</surname>
<given-names>Xavier</given-names>
</name>
<name><surname>Bowden</surname>
<given-names>Rory</given-names>
</name>
<name><surname>Wilson</surname>
<given-names>Daniel J.</given-names>
</name>
<name><surname>Peto</surname>
<given-names>Tim E. A.</given-names>
</name>
<name><surname>Crook</surname>
<given-names>Derrick W.</given-names>
</name>
</person-group>
<article-title>Transforming clinical microbiology with bacterial genome sequencing</article-title>
<source>Nature Reviews Genetics</source>
<year>2012</year>
<volume>13</volume>
<issue>9</issue>
<fpage>601</fpage>
<lpage>612</lpage>
<pub-id pub-id-type="doi">10.1038/nrg3226</pub-id>
</element-citation>
</ref>
<ref id="CR5"><label>5</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bergmiller</surname>
<given-names>T</given-names>
</name>
<name><surname>Andersson</surname>
<given-names>AM</given-names>
</name>
<name><surname>Tomasek</surname>
<given-names>K</given-names>
</name>
<name><surname>Balleza</surname>
<given-names>E</given-names>
</name>
<name><surname>Kiviet</surname>
<given-names>DJ</given-names>
</name>
<name><surname>Hauschild</surname>
<given-names>R</given-names>
</name>
<name><surname>Tkačik</surname>
<given-names>G</given-names>
</name>
<name><surname>Guet</surname>
<given-names>CC</given-names>
</name>
</person-group>
<article-title>Biased partitioning of the multidrug efflux pump <italic>AcrAB-TolC</italic>
 underlies long-lived phenotypic heterogeneity</article-title>
<source>Science</source>
<year>2017</year>
<volume>356</volume>
<issue>6335</issue>
<fpage>311</fpage>
<lpage>5</lpage>
<pub-id pub-id-type="doi">10.1126/science.aaf4762</pub-id>
<pub-id pub-id-type="pmid">28428424</pub-id>
</element-citation>
</ref>
<ref id="CR6"><label>6</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Gordon</surname>
<given-names>NC</given-names>
</name>
<name><surname>Price</surname>
<given-names>JR</given-names>
</name>
<name><surname>Cole</surname>
<given-names>K</given-names>
</name>
<name><surname>Everitt</surname>
<given-names>R</given-names>
</name>
<name><surname>Morgan</surname>
<given-names>M</given-names>
</name>
<name><surname>Finney</surname>
<given-names>F</given-names>
</name>
<name><surname>Kearns</surname>
<given-names>AM</given-names>
</name>
<name><surname>Pichon</surname>
<given-names>B</given-names>
</name>
<name><surname>Young</surname>
<given-names>B</given-names>
</name>
<name><surname>Wilson</surname>
<given-names>DJ</given-names>
</name>
<name><surname>Llewelyn</surname>
<given-names>MJ</given-names>
</name>
<name><surname>Paul</surname>
<given-names>J</given-names>
</name>
<name><surname>Peto</surname>
<given-names>TEA</given-names>
</name>
<name><surname>Crook</surname>
<given-names>D</given-names>
</name>
<name><surname>Walker</surname>
<given-names>AS</given-names>
</name>
<name><surname>Golubchika</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Prediction of <italic>Staphylococcus aureus</italic>
 Antimicrobial Resistance by Whole-Genome Sequencing</article-title>
<source>J Clin Microbiol</source>
<year>2014</year>
<volume>52</volume>
<issue>4</issue>
<fpage>1182</fpage>
<lpage>91</lpage>
<pub-id pub-id-type="doi">10.1128/JCM.03117-13</pub-id>
<pub-id pub-id-type="pmid">24501024</pub-id>
</element-citation>
</ref>
<ref id="CR7"><label>7</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bradley</surname>
<given-names>P</given-names>
</name>
<name><surname>Gordon</surname>
<given-names>NC</given-names>
</name>
<name><surname>Walker</surname>
<given-names>TM</given-names>
</name>
<name><surname>Dunn</surname>
<given-names>L</given-names>
</name>
<name><surname>Heys</surname>
<given-names>S</given-names>
</name>
<name><surname>Huang</surname>
<given-names>B</given-names>
</name>
<name><surname>Earle</surname>
<given-names>S</given-names>
</name>
<name><surname>Pankhurst</surname>
<given-names>L</given-names>
</name>
<name><surname>Anson</surname>
<given-names>L</given-names>
</name>
<name><surname>de Cesare</surname>
<given-names>M</given-names>
</name>
<name><surname>Piazza</surname>
<given-names>P</given-names>
</name>
<name><surname>Votintseva</surname>
<given-names>AA</given-names>
</name>
<name><surname>Golubchik</surname>
<given-names>T</given-names>
</name>
<name><surname>Wilson</surname>
<given-names>DJ</given-names>
</name>
<name><surname>Wyllie</surname>
<given-names>DH</given-names>
</name>
<name><surname>Diel</surname>
<given-names>R</given-names>
</name>
<name><surname>Niemann</surname>
<given-names>S</given-names>
</name>
<name><surname>Feuerriegel</surname>
<given-names>S</given-names>
</name>
<name><surname>Kohl</surname>
<given-names>TA</given-names>
</name>
<name><surname>Ismail</surname>
<given-names>N</given-names>
</name>
<name><surname>Omar</surname>
<given-names>SV</given-names>
</name>
<name><surname>Smith</surname>
<given-names>EG</given-names>
</name>
<name><surname>Buck</surname>
<given-names>D</given-names>
</name>
<name><surname>McVean</surname>
<given-names>G</given-names>
</name>
<name><surname>Walker</surname>
<given-names>AS</given-names>
</name>
<name><surname>Peto</surname>
<given-names>T</given-names>
</name>
<name><surname>Crook</surname>
<given-names>D</given-names>
</name>
<name><surname>Iqbal</surname>
<given-names>Z</given-names>
</name>
</person-group>
<article-title>Rapid antibiotic-resistance predictions from genome sequence data for <italic>Staphylococcus aureus</italic>
 and <italic>Mycobacterium tuberculosis</italic>
</article-title>
<source>Nat Commun</source>
<year>2015</year>
<volume>6</volume>
<fpage>10063</fpage>
<pub-id pub-id-type="doi">10.1038/ncomms10063</pub-id>
<pub-id pub-id-type="pmid">26686880</pub-id>
</element-citation>
</ref>
<ref id="CR8"><label>8</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Walker</surname>
<given-names>TM</given-names>
</name>
<name><surname>Kohl</surname>
<given-names>TA</given-names>
</name>
<name><surname>Omar</surname>
<given-names>SV</given-names>
</name>
<name><surname>Hedge</surname>
<given-names>J</given-names>
</name>
<name><surname>Elias</surname>
<given-names>CDO</given-names>
</name>
<name><surname>Bradley</surname>
<given-names>P</given-names>
</name>
<name><surname>Iqbal</surname>
<given-names>Z</given-names>
</name>
<name><surname>Feuerriegel</surname>
<given-names>S</given-names>
</name>
<name><surname>Niehaus</surname>
<given-names>KE</given-names>
</name>
<name><surname>Wilson</surname>
<given-names>DJ</given-names>
</name>
<name><surname>Clifton</surname>
<given-names>DA</given-names>
</name>
<name><surname>Kapatai</surname>
<given-names>G</given-names>
</name>
<name><surname>Ip</surname>
<given-names>CLC</given-names>
</name>
<name><surname>Bowden</surname>
<given-names>R</given-names>
</name>
<name><surname>Drobniewski</surname>
<given-names>FA</given-names>
</name>
<name><surname>Allix-Béguec</surname>
<given-names>C</given-names>
</name>
<name><surname>Gaudin</surname>
<given-names>C</given-names>
</name>
<name><surname>Parkhill</surname>
<given-names>J</given-names>
</name>
<name><surname>Diel</surname>
<given-names>R</given-names>
</name>
<name><surname>Supply</surname>
<given-names>P</given-names>
</name>
<name><surname>Crook</surname>
<given-names>D</given-names>
</name>
<name><surname>Smith</surname>
<given-names>EG</given-names>
</name>
<name><surname>Walker</surname>
<given-names>AS</given-names>
</name>
<name><surname>Ismail</surname>
<given-names>N</given-names>
</name>
<name><surname>Niemann</surname>
<given-names>S</given-names>
</name>
<name><surname>Peto</surname>
<given-names>TEA</given-names>
</name>
</person-group>
<article-title>Whole-genome sequencing for prediction of <italic>Mycobacterium tuberculosis</italic>
 drug susceptibility and resistance: a retrospective cohort study</article-title>
<source>Lancet Infect Dis</source>
<year>2015</year>
<volume>15</volume>
<fpage>1193</fpage>
<lpage>202</lpage>
<pub-id pub-id-type="doi">10.1016/S1473-3099(15)00062-6</pub-id>
<pub-id pub-id-type="pmid">26116186</pub-id>
</element-citation>
</ref>
<ref id="CR9"><label>9</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Coll</surname>
<given-names>F</given-names>
</name>
<name><surname>McNerney</surname>
<given-names>R</given-names>
</name>
<name><surname>Preston</surname>
<given-names>MD</given-names>
</name>
<name><surname>Guerra-Assunção</surname>
<given-names>JA</given-names>
</name>
<name><surname>Warry</surname>
<given-names>A</given-names>
</name>
<name><surname>Hill-Cawthorne</surname>
<given-names>G</given-names>
</name>
<name><surname>Mallard</surname>
<given-names>K</given-names>
</name>
<name><surname>Nair</surname>
<given-names>M</given-names>
</name>
<name><surname>Miranda</surname>
<given-names>A</given-names>
</name>
<name><surname>Alves</surname>
<given-names>A</given-names>
</name>
<name><surname>Perdigão</surname>
<given-names>J</given-names>
</name>
<name><surname>Viveiros</surname>
<given-names>M</given-names>
</name>
<name><surname>Portugal</surname>
<given-names>I</given-names>
</name>
<name><surname>Hasan</surname>
<given-names>Z</given-names>
</name>
<name><surname>Hasan</surname>
<given-names>R</given-names>
</name>
<name><surname>Glynn</surname>
<given-names>JR</given-names>
</name>
<name><surname>Martin</surname>
<given-names>N</given-names>
</name>
<name><surname>Pain</surname>
<given-names>A</given-names>
</name>
<name><surname>Clark</surname>
<given-names>TG</given-names>
</name>
</person-group>
<article-title>Rapid determination of anti-tuberculosis drug resistance from whole-genome sequences</article-title>
<source>Genome Med</source>
<year>2015</year>
<volume>7</volume>
<issue>1</issue>
<fpage>51</fpage>
<pub-id pub-id-type="doi">10.1186/s13073-015-0164-0</pub-id>
<pub-id pub-id-type="pmid">26019726</pub-id>
</element-citation>
</ref>
<ref id="CR10"><label>10</label>
<mixed-citation publication-type="other">Schleusener V, Köser CU, Beckert P, Niemann S, Feuerriegel S. <italic>Mycobacterium tuberculosis</italic>
 resistance prediction and lineage classification from genome sequencing: comparison of automated analysis tools. Bioinformatics. 2018; 4(10):1666–71. see <ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/pubmed/29240876">https://www.ncbi.nlm.nih.gov/pubmed/29240876</ext-link>
.</mixed-citation>
</ref>
<ref id="CR11"><label>11</label>
<mixed-citation publication-type="other">Yang Y, Niehaus KE, Walker TM, Iqbal Z, Walker AS, Wilson DJ, Peto TEA, Crook D, Smith EG, Zhu T, Clifton DA. Machine learning for classifying tuberculosis drug-resistance from DNA sequencing data. Bioinformatics. 2017;801.</mixed-citation>
</ref>
<ref id="CR12"><label>12</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Palomino</surname>
<given-names>JC</given-names>
</name>
<name><surname>Martin</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Drug resistance mechanisms in <italic>Mycobacterium tuberculosis</italic>
</article-title>
<source>Antibiotics</source>
<year>2014</year>
<volume>3</volume>
<fpage>317</fpage>
<lpage>40</lpage>
<pub-id pub-id-type="doi">10.3390/antibiotics3030317</pub-id>
<pub-id pub-id-type="pmid">27025748</pub-id>
</element-citation>
</ref>
<ref id="CR13"><label>13</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname>
<given-names>Y</given-names>
</name>
<name><surname>Yew</surname>
<given-names>WW</given-names>
</name>
</person-group>
<article-title>Mechanisms of drug resistance in <italic>Mycobacterium tuberculosis</italic>
</article-title>
<source>Int J Tuberc Lung Dis</source>
<year>2009</year>
<volume>13</volume>
<fpage>1320</fpage>
<lpage>30</lpage>
<pub-id pub-id-type="pmid">19861002</pub-id>
</element-citation>
</ref>
<ref id="CR14"><label>14</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname>
<given-names>H</given-names>
</name>
<name><surname>Li</surname>
<given-names>D</given-names>
</name>
<name><surname>Zhao</surname>
<given-names>L</given-names>
</name>
<name><surname>Fleming</surname>
<given-names>J</given-names>
</name>
<name><surname>Lin</surname>
<given-names>N</given-names>
</name>
<name><surname>Wang</surname>
<given-names>T</given-names>
</name>
<name><surname>Liu</surname>
<given-names>Z</given-names>
</name>
<name><surname>Li</surname>
<given-names>C</given-names>
</name>
<name><surname>Galwey</surname>
<given-names>N</given-names>
</name>
<name><surname>Deng</surname>
<given-names>J</given-names>
</name>
<name><surname>Zhou</surname>
<given-names>Y</given-names>
</name>
<name><surname>Zhu</surname>
<given-names>Y</given-names>
</name>
<name><surname>Gao</surname>
<given-names>Y</given-names>
</name>
<name><surname>Wang</surname>
<given-names>T</given-names>
</name>
<name><surname>Wang</surname>
<given-names>S</given-names>
</name>
<name><surname>Huang</surname>
<given-names>Y</given-names>
</name>
<name><surname>Wang</surname>
<given-names>M</given-names>
</name>
<name><surname>Zhong</surname>
<given-names>Q</given-names>
</name>
<name><surname>Zhou</surname>
<given-names>L</given-names>
</name>
<name><surname>Chen</surname>
<given-names>T</given-names>
</name>
<name><surname>Zhou</surname>
<given-names>J</given-names>
</name>
<name><surname>Yang</surname>
<given-names>R</given-names>
</name>
<name><surname>Zhu</surname>
<given-names>G</given-names>
</name>
<name><surname>Hang</surname>
<given-names>H</given-names>
</name>
<name><surname>Zhang</surname>
<given-names>J</given-names>
</name>
<name><surname>Li</surname>
<given-names>F</given-names>
</name>
<name><surname>Wan</surname>
<given-names>K</given-names>
</name>
<name><surname>Wang</surname>
<given-names>J</given-names>
</name>
<name><surname>Zhang</surname>
<given-names>X-E</given-names>
</name>
<name><surname>Bi</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Genome sequencing of 161 <italic>Mycobacterium tuberculosis</italic>
 isolates from China identifies genes and intergenic regions associated with drug resistance</article-title>
<source>Nat Genet</source>
<year>2013</year>
<volume>45</volume>
<fpage>1255</fpage>
<lpage>60</lpage>
<pub-id pub-id-type="doi">10.1038/ng.2735</pub-id>
<pub-id pub-id-type="pmid">23995137</pub-id>
</element-citation>
</ref>
<ref id="CR15"><label>15</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Palmer</surname>
<given-names>AC</given-names>
</name>
<name><surname>Kishony</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Understanding, predicting and manipulating the genotypic evolution of antibiotic resistance</article-title>
<source>Nat Rev Genet</source>
<year>2013</year>
<volume>14</volume>
<fpage>243</fpage>
<lpage>8</lpage>
<pub-id pub-id-type="doi">10.1038/nrg3351</pub-id>
<pub-id pub-id-type="pmid">23419278</pub-id>
</element-citation>
</ref>
<ref id="CR16"><label>16</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lees</surname>
<given-names>JA</given-names>
</name>
<name><surname>Vehkala</surname>
<given-names>M</given-names>
</name>
<name><surname>Välimäki</surname>
<given-names>N</given-names>
</name>
<name><surname>Harris</surname>
<given-names>SR</given-names>
</name>
<name><surname>Chewapreecha</surname>
<given-names>C</given-names>
</name>
<name><surname>Croucher</surname>
<given-names>NJ</given-names>
</name>
<name><surname>Marttinen</surname>
<given-names>P</given-names>
</name>
<name><surname>Honkela</surname>
<given-names>A</given-names>
</name>
<name><surname>Parkhill</surname>
<given-names>J</given-names>
</name>
<name><surname>Bentley</surname>
<given-names>SD</given-names>
</name>
<name><surname>Corander</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes</article-title>
<source>Nat Commun</source>
<year>2016</year>
<volume>7</volume>
<fpage>12797</fpage>
<pub-id pub-id-type="doi">10.1038/ncomms12797</pub-id>
<pub-id pub-id-type="pmid">27633831</pub-id>
</element-citation>
</ref>
<ref id="CR17"><label>17</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Earle</surname>
<given-names>SG</given-names>
</name>
<name><surname>Wu</surname>
<given-names>C-H</given-names>
</name>
<name><surname>Charlesworth</surname>
<given-names>J</given-names>
</name>
<name><surname>Stoesser</surname>
<given-names>N</given-names>
</name>
<name><surname>Gordon</surname>
<given-names>NC</given-names>
</name>
<name><surname>Walker</surname>
<given-names>TM</given-names>
</name>
<name><surname>Spencer</surname>
<given-names>CCA</given-names>
</name>
<name><surname>Iqbal</surname>
<given-names>Z</given-names>
</name>
<name><surname>Clifton</surname>
<given-names>DA</given-names>
</name>
<name><surname>Hopkins</surname>
<given-names>KL</given-names>
</name>
<name><surname>Woodford</surname>
<given-names>N</given-names>
</name>
<name><surname>Smith</surname>
<given-names>EG</given-names>
</name>
<name><surname>Ismail</surname>
<given-names>N</given-names>
</name>
<name><surname>Llewelyn</surname>
<given-names>MJ</given-names>
</name>
<name><surname>Peto</surname>
<given-names>TE</given-names>
</name>
<name><surname>Crook</surname>
<given-names>D</given-names>
</name>
<name><surname>McVean</surname>
<given-names>G</given-names>
</name>
<name><surname>Walker</surname>
<given-names>AS</given-names>
</name>
<name><surname>Wilson</surname>
<given-names>DJ</given-names>
</name>
</person-group>
<article-title>Identifying lineage effects when controlling for population structure improves power in bacterial association studies</article-title>
<source>Nat Microbiol</source>
<year>2016</year>
<volume>1</volume>
<fpage>16041</fpage>
<pub-id pub-id-type="doi">10.1038/nmicrobiol.2016.41</pub-id>
<pub-id pub-id-type="pmid">27572646</pub-id>
</element-citation>
</ref>
<ref id="CR18"><label>18</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Drouin</surname>
<given-names>A</given-names>
</name>
<name><surname>Giguère</surname>
<given-names>S</given-names>
</name>
<name><surname>Déraspe</surname>
<given-names>M</given-names>
</name>
<name><surname>Marchand</surname>
<given-names>M</given-names>
</name>
<name><surname>Tyers</surname>
<given-names>M</given-names>
</name>
<name><surname>Loo</surname>
<given-names>VG</given-names>
</name>
<name><surname>Bourgault</surname>
<given-names>A-M</given-names>
</name>
<name><surname>Laviolette</surname>
<given-names>F</given-names>
</name>
<name><surname>Corbeil</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons</article-title>
<source>BMC Genomics</source>
<year>2016</year>
<volume>17</volume>
<issue>1</issue>
<fpage>754</fpage>
<pub-id pub-id-type="doi">10.1186/s12864-016-2889-6</pub-id>
<pub-id pub-id-type="pmid">27671088</pub-id>
</element-citation>
</ref>
<ref id="CR19"><label>19</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Davis</surname>
<given-names>JJ</given-names>
</name>
<name><surname>Boisvert</surname>
<given-names>S</given-names>
</name>
<name><surname>Brettin</surname>
<given-names>T</given-names>
</name>
<name><surname>Kenyon</surname>
<given-names>RW</given-names>
</name>
<name><surname>Mao</surname>
<given-names>C</given-names>
</name>
<name><surname>Olson</surname>
<given-names>R</given-names>
</name>
<name><surname>Overbeek</surname>
<given-names>R</given-names>
</name>
<name><surname>Santerre</surname>
<given-names>J</given-names>
</name>
<name><surname>Shukla</surname>
<given-names>M</given-names>
</name>
<name><surname>Wattam</surname>
<given-names>AR</given-names>
</name>
<name><surname>Will</surname>
<given-names>R</given-names>
</name>
<name><surname>Xia</surname>
<given-names>F</given-names>
</name>
<name><surname>Stevens</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Antimicrobial resistance prediction in PATRIC and RAST</article-title>
<source>Sci Rep</source>
<year>2016</year>
<volume>6</volume>
<fpage>27930</fpage>
<pub-id pub-id-type="doi">10.1038/srep27930</pub-id>
<pub-id pub-id-type="pmid">27297683</pub-id>
</element-citation>
</ref>
<ref id="CR20"><label>20</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Eyre</surname>
<given-names>DW</given-names>
</name>
<name><surname>De Silva</surname>
<given-names>D</given-names>
</name>
<name><surname>Cole</surname>
<given-names>K</given-names>
</name>
<name><surname>Peters</surname>
<given-names>J</given-names>
</name>
<name><surname>Cole</surname>
<given-names>MJ</given-names>
</name>
<name><surname>Grad</surname>
<given-names>YH</given-names>
</name>
<name><surname>Demczuk</surname>
<given-names>W</given-names>
</name>
<name><surname>Martin</surname>
<given-names>I</given-names>
</name>
<name><surname>Mulvey</surname>
<given-names>MR</given-names>
</name>
<name><surname>Crook</surname>
<given-names>D</given-names>
</name>
<etal></etal>
</person-group>
<article-title>WGS to predict antibiotic MICs for <italic>Neisseria gonorrhoeae</italic>
</article-title>
<source>J Antimicrob Chemother</source>
<year>2017</year>
<volume>72</volume>
<issue>7</issue>
<fpage>1937</fpage>
<lpage>47</lpage>
<pub-id pub-id-type="doi">10.1093/jac/dkx067</pub-id>
<pub-id pub-id-type="pmid">28333355</pub-id>
</element-citation>
</ref>
<ref id="CR21"><label>21</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Meinshausen</surname>
<given-names>N</given-names>
</name>
<name><surname>Bühlmann</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Stability selection</article-title>
<source>J R Stat Soc Ser B</source>
<year>2010</year>
<volume>72</volume>
<fpage>417</fpage>
<lpage>73</lpage>
<pub-id pub-id-type="doi">10.1111/j.1467-9868.2010.00740.x</pub-id>
</element-citation>
</ref>
<ref id="CR22"><label>22</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Boisvert</surname>
<given-names>Sébastien</given-names>
</name>
<name><surname>Raymond</surname>
<given-names>Frédéric</given-names>
</name>
<name><surname>Godzaridis</surname>
<given-names>Élénie</given-names>
</name>
<name><surname>Laviolette</surname>
<given-names>François</given-names>
</name>
<name><surname>Corbeil</surname>
<given-names>Jacques</given-names>
</name>
</person-group>
<article-title>Ray Meta: scalable de novo metagenome assembly and profiling</article-title>
<source>Genome Biology</source>
<year>2012</year>
<volume>13</volume>
<issue>12</issue>
<fpage>R122</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2012-13-12-r122</pub-id>
<pub-id pub-id-type="pmid">23259615</pub-id>
</element-citation>
</ref>
<ref id="CR23"><label>23</label>
<mixed-citation publication-type="other">Bach FR. Bolasso: model consistent lasso estimation,through the bootstrap In: Cohen WW, Mccallum A, Roweis ST, editors. International Conference on Machine Learning: 2008. p. 33–40. <ext-link ext-link-type="uri" xlink:href="http://doi.acm.org/10.1145/1390156.1390161">http://doi.acm.org/10.1145/1390156.1390161</ext-link>
.</mixed-citation>
</ref>
<ref id="CR24"><label>24</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lim</surname>
<given-names>C</given-names>
</name>
<name><surname>Yu</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Estimation stability with cross-validation (ESCV)</article-title>
<source>J Comput Graph Stat</source>
<year>2016</year>
<volume>25</volume>
<issue>2</issue>
<fpage>464</fpage>
<lpage>92</lpage>
<pub-id pub-id-type="doi">10.1080/10618600.2015.1020159</pub-id>
</element-citation>
</ref>
<ref id="CR25"><label>25</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Friedman</surname>
<given-names>J</given-names>
</name>
<name><surname>Hastie</surname>
<given-names>T</given-names>
</name>
<name><surname>Tibshirani</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Regularization paths for generalized linear models via coordinate descent</article-title>
<source>J Stat Softw</source>
<year>2010</year>
<volume>33</volume>
<issue>1</issue>
<fpage>1</fpage>
<lpage>22</lpage>
<pub-id pub-id-type="doi">10.18637/jss.v033.i01</pub-id>
<pub-id pub-id-type="pmid">20808728</pub-id>
</element-citation>
</ref>
<ref id="CR26"><label>26</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Kurtz</surname>
<given-names>S</given-names>
</name>
<name><surname>Phillippy</surname>
<given-names>A</given-names>
</name>
<name><surname>Delcher</surname>
<given-names>AL</given-names>
</name>
<name><surname>Smoot</surname>
<given-names>M</given-names>
</name>
<name><surname>Shumway</surname>
<given-names>M</given-names>
</name>
<name><surname>Antonescu</surname>
<given-names>C</given-names>
</name>
<name><surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>Versatile and open software for comparing large genomes</article-title>
<source>Genome Biol</source>
<year>2004</year>
<volume>5</volume>
<issue>2</issue>
<fpage>12</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2004-5-2-r12</pub-id>
</element-citation>
</ref>
<ref id="CR27"><label>27</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Chikhi</surname>
<given-names>R</given-names>
</name>
<name><surname>Limasset</surname>
<given-names>A</given-names>
</name>
<name><surname>Medvedev</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Compacting De Bruijn graphs from sequencing data quickly and in low memory</article-title>
<source>Bioinformatics</source>
<year>2016</year>
<volume>32</volume>
<issue>12</issue>
<fpage>201</fpage>
<lpage>8</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btw279</pub-id>
</element-citation>
</ref>
<ref id="CR28"><label>28</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Li</surname>
<given-names>J</given-names>
</name>
<name><surname>Gao</surname>
<given-names>X</given-names>
</name>
<name><surname>Luo</surname>
<given-names>T</given-names>
</name>
<name><surname>Wu</surname>
<given-names>J</given-names>
</name>
<name><surname>Sun</surname>
<given-names>G</given-names>
</name>
<name><surname>Liu</surname>
<given-names>Q</given-names>
</name>
<name><surname>Jiang</surname>
<given-names>Y</given-names>
</name>
<name><surname>Zhang</surname>
<given-names>Y</given-names>
</name>
<name><surname>Mei</surname>
<given-names>J</given-names>
</name>
<name><surname>Gao</surname>
<given-names>Q</given-names>
</name>
</person-group>
<article-title>Association of <italic>gyrA/B</italic>
 mutations and resistance levels to fluoroquinolones in clinical isolates of <italic>Mycobacterium tuberculosis</italic>
</article-title>
<source>Emerg Microbes Infect</source>
<year>2014</year>
<volume>3</volume>
<issue>3</issue>
<fpage>19</fpage>
<pub-id pub-id-type="doi">10.1038/emi.2014.21</pub-id>
</element-citation>
</ref>
<ref id="CR29"><label>29</label>
<element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Jnawali</surname>
<given-names>HN</given-names>
</name>
<name><surname>Ryoo</surname>
<given-names>S</given-names>
</name>
</person-group>
<person-group person-group-type="editor"><name><surname>Mahboub</surname>
<given-names>BH</given-names>
</name>
<name><surname>Vats</surname>
<given-names>MG</given-names>
</name>
</person-group>
<article-title>First- and second-line drugs and drug resistance</article-title>
<source>Tuberculosis- Current Issues in Diagnosis and Managment</source>
<year>2013</year>
<publisher-loc>London</publisher-loc>
<publisher-name>IntechOpen</publisher-name>
</element-citation>
</ref>
<ref id="CR30"><label>30</label>
<mixed-citation publication-type="other">Lázár V, Nagy I, Spohn R, Csörgö B, Györkei A, Nyerges A, Horváth B, Vörös A, Busa-Fekete R, Hrtyan M, Bogos B, Méhi O, Fekete G, Szappanos B, Kégl B, Papp B, Pál C. Genome-wide analysis captures the determinants of the antibiotic cross-resistance interaction network. Nat Commun. 2014;5. 10.1038/ncomms5352.</mixed-citation>
</ref>
<ref id="CR31"><label>31</label>
<element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Kim</surname>
<given-names>S</given-names>
</name>
<name><surname>Xing</surname>
<given-names>EP</given-names>
</name>
</person-group>
<article-title>Tree-guided group lasso for multi-task regression with structured sparsity</article-title>
<source>International Conference on Machine Learning</source>
<year>2010</year>
<publisher-loc>USA</publisher-loc>
<publisher-name>Omnipress</publisher-name>
</element-citation>
</ref>
<ref id="CR32"><label>32</label>
<element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Vervier</surname>
<given-names>K</given-names>
</name>
<name><surname>Mahé</surname>
<given-names>P</given-names>
</name>
<name><surname>D’Aspremont</surname>
<given-names>A</given-names>
</name>
<name><surname>Veyrieras</surname>
<given-names>J-B</given-names>
</name>
<name><surname>Vert</surname>
<given-names>J-P</given-names>
</name>
</person-group>
<article-title>On learning matrices with orthogonal columns or disjoint supports</article-title>
<source>Joint European Conference on Machine Learning and Knowledge Discovery in Databases</source>
<year>2014</year>
<publisher-loc>Berlin</publisher-loc>
<publisher-name>Springer Berlin Heidelberg</publisher-name>
</element-citation>
</ref>
<ref id="CR33"><label>33</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>McCullagh</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Regression models for ordinal data</article-title>
<source>J R Stat Soc Ser B</source>
<year>1980</year>
<volume>42</volume>
<fpage>109</fpage>
<lpage>42</lpage>
</element-citation>
</ref>
<ref id="CR34"><label>34</label>
<element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Dundar</surname>
<given-names>M</given-names>
</name>
<name><surname>Krishnapuram</surname>
<given-names>B</given-names>
</name>
<name><surname>Bi</surname>
<given-names>J</given-names>
</name>
<name><surname>Rao</surname>
<given-names>RB</given-names>
</name>
</person-group>
<article-title>Learning classifiers when the training data is not IID</article-title>
<source>International Joint Conference on Artificial Intelligence</source>
<year>2007</year>
<publisher-loc>San Francisco</publisher-loc>
<publisher-name>Morgan Kaufmann Publishers Inc.</publisher-name>
</element-citation>
</ref>
<ref id="CR35"><label>35</label>
<mixed-citation publication-type="other">van der Helm E, Imamovic L, Hashim Ellabaan MM, van Schaik W, Koza A, Sommer MOA. Rapid resistome mapping using Nanopore sequencing. Nucleic Acids Res. 2017; 45(8):61. 10.1093/nar/gkw1328.</mixed-citation>
</ref>
<ref id="CR36"><label>36</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Votintseva</surname>
<given-names>Antonina A.</given-names>
</name>
<name><surname>Bradley</surname>
<given-names>Phelim</given-names>
</name>
<name><surname>Pankhurst</surname>
<given-names>Louise</given-names>
</name>
<name><surname>del Ojo Elias</surname>
<given-names>Carlos</given-names>
</name>
<name><surname>Loose</surname>
<given-names>Matthew</given-names>
</name>
<name><surname>Nilgiriwala</surname>
<given-names>Kayzad</given-names>
</name>
<name><surname>Chatterjee</surname>
<given-names>Anirvan</given-names>
</name>
<name><surname>Smith</surname>
<given-names>E. Grace</given-names>
</name>
<name><surname>Sanderson</surname>
<given-names>Nicolas</given-names>
</name>
<name><surname>Walker</surname>
<given-names>Timothy M.</given-names>
</name>
<name><surname>Morgan</surname>
<given-names>Marcus R.</given-names>
</name>
<name><surname>Wyllie</surname>
<given-names>David H.</given-names>
</name>
<name><surname>Walker</surname>
<given-names>A. Sarah</given-names>
</name>
<name><surname>Peto</surname>
<given-names>Tim E. A.</given-names>
</name>
<name><surname>Crook</surname>
<given-names>Derrick W.</given-names>
</name>
<name><surname>Iqbal</surname>
<given-names>Zamin</given-names>
</name>
</person-group>
<article-title>Same-Day Diagnostic and Surveillance Data for Tuberculosis via Whole-Genome Sequencing of Direct Respiratory Samples</article-title>
<source>Journal of Clinical Microbiology</source>
<year>2017</year>
<volume>55</volume>
<issue>5</issue>
<fpage>1285</fpage>
<lpage>1298</lpage>
<pub-id pub-id-type="doi">10.1128/JCM.02483-16</pub-id>
<pub-id pub-id-type="pmid">28275074</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000276 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000276 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:6192184
   |texte=   Predicting bacterial resistance from whole-genome sequences using k-mers and stability selection
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:30332990" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration MERS

Predicting bacterial resistance from whole-genome sequences using k-mers and stability selection

Predicting bacterial resistance from whole-genome sequences using k-mers and stability selection

Source :

Abstract

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki