MersV1, Pmc, Corpus, bibRecord, 000142

***** Acces problem to record *****\

Identifieur interne : 000142 ( Pmc/Corpus ); précédent : 0001419; suivant : 0001430 ***** probable Xml problem with record *****

Links to Exploration step

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes</title>
<author><name sortKey="Lees, John A" sort="Lees, John A" uniqKey="Lees J" first="John A." last="Lees">John A. Lees</name>
<affiliation><nlm:aff id="a1"><institution>Pathogen Genomics, Wellcome Trust Sanger Institute</institution>
, Cambridge CB10 1SA,<country>UK</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Vehkala, Minna" sort="Vehkala, Minna" uniqKey="Vehkala M" first="Minna" last="Vehkala">Minna Vehkala</name>
<affiliation><nlm:aff id="a2"><institution>Department of Mathematics and Statistics, University of Helsinki</institution>
, Helsinki FI-00014,<country>Finland</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="V Lim Ki, Niko" sort="V Lim Ki, Niko" uniqKey="V Lim Ki N" first="Niko" last="V Lim Ki">Niko V Lim Ki</name>
<affiliation><nlm:aff id="a3"><institution>Department of Medical and Clinical Genetics, Genome-Scale Biology Research Program, University of Helsinki</institution>
, Helsinki FI-00014,<country>Finland</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Harris, Simon R" sort="Harris, Simon R" uniqKey="Harris S" first="Simon R." last="Harris">Simon R. Harris</name>
<affiliation><nlm:aff id="a1"><institution>Pathogen Genomics, Wellcome Trust Sanger Institute</institution>
, Cambridge CB10 1SA,<country>UK</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Chewapreecha, Claire" sort="Chewapreecha, Claire" uniqKey="Chewapreecha C" first="Claire" last="Chewapreecha">Claire Chewapreecha</name>
<affiliation><nlm:aff id="a4"><institution>Department of Medicine, University of Cambridge</institution>
, Cambridge CB2 0SP,<country>UK</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Croucher, Nicholas J" sort="Croucher, Nicholas J" uniqKey="Croucher N" first="Nicholas J." last="Croucher">Nicholas J. Croucher</name>
<affiliation><nlm:aff id="a5"><institution>Department of Infectious Disease Epidemiology, Imperial College</institution>
, London W2 1NY,<country>UK</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Marttinen, Pekka" sort="Marttinen, Pekka" uniqKey="Marttinen P" first="Pekka" last="Marttinen">Pekka Marttinen</name>
<affiliation><nlm:aff id="a6"><institution>Department of Computer Science, Aalto University</institution>
, Espoo FI-00076,<country>Finland</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="a7"><institution>Helsinki Institute of Information Technology HIIT, Department of Computer Science, Aalto University</institution>
, Espoo FI-00076,<country>Finland</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Davies, Mark R" sort="Davies, Mark R" uniqKey="Davies M" first="Mark R." last="Davies">Mark R. Davies</name>
<affiliation><nlm:aff id="a8"><institution>Department of Microbiology and Immunology, Peter Doherty Institute for Infection and Immunity, University of Melbourne</institution>
, Melbourne, Victoria 3010,<country>Australia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Steer, Andrew C" sort="Steer, Andrew C" uniqKey="Steer A" first="Andrew C." last="Steer">Andrew C. Steer</name>
<affiliation><nlm:aff id="a9"><institution>Centre for International Child Health, Department of Paediatrics, University of Melbourne</institution>
, Melbourne, Victoria 3052,<country>Australia</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="a10"><institution>Group A Streptococcal Research Group, Murdoch Children's Research Institute</institution>
, Parkville, Victoria 3052,<country>Australia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Tong, Steven Y C" sort="Tong, Steven Y C" uniqKey="Tong S" first="Steven Y. C." last="Tong">Steven Y. C. Tong</name>
<affiliation><nlm:aff id="a11"><institution>Menzies School of Health Research</institution>
, Darwin, Northern Territory 0811,<country>Australia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Honkela, Antti" sort="Honkela, Antti" uniqKey="Honkela A" first="Antti" last="Honkela">Antti Honkela</name>
<affiliation><nlm:aff id="a12"><institution>Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki</institution>
, Helsinki FI-00014,<country>Finland</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Parkhill, Julian" sort="Parkhill, Julian" uniqKey="Parkhill J" first="Julian" last="Parkhill">Julian Parkhill</name>
<affiliation><nlm:aff id="a1"><institution>Pathogen Genomics, Wellcome Trust Sanger Institute</institution>
, Cambridge CB10 1SA,<country>UK</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Bentley, Stephen D" sort="Bentley, Stephen D" uniqKey="Bentley S" first="Stephen D." last="Bentley">Stephen D. Bentley</name>
<affiliation><nlm:aff id="a1"><institution>Pathogen Genomics, Wellcome Trust Sanger Institute</institution>
, Cambridge CB10 1SA,<country>UK</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Corander, Jukka" sort="Corander, Jukka" uniqKey="Corander J" first="Jukka" last="Corander">Jukka Corander</name>
<affiliation><nlm:aff id="a1"><institution>Pathogen Genomics, Wellcome Trust Sanger Institute</institution>
, Cambridge CB10 1SA,<country>UK</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="a2"><institution>Department of Mathematics and Statistics, University of Helsinki</institution>
, Helsinki FI-00014,<country>Finland</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="a13"><institution>Department of Biostatistics, University of Oslo</institution>
, 0317 Oslo,<country>Norway</country>
</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">27633831</idno>
<idno type="pmc">5028413</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5028413</idno>
<idno type="RBID">PMC:5028413</idno>
<idno type="doi">10.1038/ncomms12797</idno>
<date when="2016">2016</date>
<idno type="wicri:Area/Pmc/Corpus">000142</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000142</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes</title>
<author><name sortKey="Lees, John A" sort="Lees, John A" uniqKey="Lees J" first="John A." last="Lees">John A. Lees</name>
<affiliation><nlm:aff id="a1"><institution>Pathogen Genomics, Wellcome Trust Sanger Institute</institution>
, Cambridge CB10 1SA,<country>UK</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Vehkala, Minna" sort="Vehkala, Minna" uniqKey="Vehkala M" first="Minna" last="Vehkala">Minna Vehkala</name>
<affiliation><nlm:aff id="a2"><institution>Department of Mathematics and Statistics, University of Helsinki</institution>
, Helsinki FI-00014,<country>Finland</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="V Lim Ki, Niko" sort="V Lim Ki, Niko" uniqKey="V Lim Ki N" first="Niko" last="V Lim Ki">Niko V Lim Ki</name>
<affiliation><nlm:aff id="a3"><institution>Department of Medical and Clinical Genetics, Genome-Scale Biology Research Program, University of Helsinki</institution>
, Helsinki FI-00014,<country>Finland</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Harris, Simon R" sort="Harris, Simon R" uniqKey="Harris S" first="Simon R." last="Harris">Simon R. Harris</name>
<affiliation><nlm:aff id="a1"><institution>Pathogen Genomics, Wellcome Trust Sanger Institute</institution>
, Cambridge CB10 1SA,<country>UK</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Chewapreecha, Claire" sort="Chewapreecha, Claire" uniqKey="Chewapreecha C" first="Claire" last="Chewapreecha">Claire Chewapreecha</name>
<affiliation><nlm:aff id="a4"><institution>Department of Medicine, University of Cambridge</institution>
, Cambridge CB2 0SP,<country>UK</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Croucher, Nicholas J" sort="Croucher, Nicholas J" uniqKey="Croucher N" first="Nicholas J." last="Croucher">Nicholas J. Croucher</name>
<affiliation><nlm:aff id="a5"><institution>Department of Infectious Disease Epidemiology, Imperial College</institution>
, London W2 1NY,<country>UK</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Marttinen, Pekka" sort="Marttinen, Pekka" uniqKey="Marttinen P" first="Pekka" last="Marttinen">Pekka Marttinen</name>
<affiliation><nlm:aff id="a6"><institution>Department of Computer Science, Aalto University</institution>
, Espoo FI-00076,<country>Finland</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="a7"><institution>Helsinki Institute of Information Technology HIIT, Department of Computer Science, Aalto University</institution>
, Espoo FI-00076,<country>Finland</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Davies, Mark R" sort="Davies, Mark R" uniqKey="Davies M" first="Mark R." last="Davies">Mark R. Davies</name>
<affiliation><nlm:aff id="a8"><institution>Department of Microbiology and Immunology, Peter Doherty Institute for Infection and Immunity, University of Melbourne</institution>
, Melbourne, Victoria 3010,<country>Australia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Steer, Andrew C" sort="Steer, Andrew C" uniqKey="Steer A" first="Andrew C." last="Steer">Andrew C. Steer</name>
<affiliation><nlm:aff id="a9"><institution>Centre for International Child Health, Department of Paediatrics, University of Melbourne</institution>
, Melbourne, Victoria 3052,<country>Australia</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="a10"><institution>Group A Streptococcal Research Group, Murdoch Children's Research Institute</institution>
, Parkville, Victoria 3052,<country>Australia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Tong, Steven Y C" sort="Tong, Steven Y C" uniqKey="Tong S" first="Steven Y. C." last="Tong">Steven Y. C. Tong</name>
<affiliation><nlm:aff id="a11"><institution>Menzies School of Health Research</institution>
, Darwin, Northern Territory 0811,<country>Australia</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Honkela, Antti" sort="Honkela, Antti" uniqKey="Honkela A" first="Antti" last="Honkela">Antti Honkela</name>
<affiliation><nlm:aff id="a12"><institution>Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki</institution>
, Helsinki FI-00014,<country>Finland</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Parkhill, Julian" sort="Parkhill, Julian" uniqKey="Parkhill J" first="Julian" last="Parkhill">Julian Parkhill</name>
<affiliation><nlm:aff id="a1"><institution>Pathogen Genomics, Wellcome Trust Sanger Institute</institution>
, Cambridge CB10 1SA,<country>UK</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Bentley, Stephen D" sort="Bentley, Stephen D" uniqKey="Bentley S" first="Stephen D." last="Bentley">Stephen D. Bentley</name>
<affiliation><nlm:aff id="a1"><institution>Pathogen Genomics, Wellcome Trust Sanger Institute</institution>
, Cambridge CB10 1SA,<country>UK</country>
</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Corander, Jukka" sort="Corander, Jukka" uniqKey="Corander J" first="Jukka" last="Corander">Jukka Corander</name>
<affiliation><nlm:aff id="a1"><institution>Pathogen Genomics, Wellcome Trust Sanger Institute</institution>
, Cambridge CB10 1SA,<country>UK</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="a2"><institution>Department of Mathematics and Statistics, University of Helsinki</institution>
, Helsinki FI-00014,<country>Finland</country>
</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="a13"><institution>Department of Biostatistics, University of Oslo</institution>
, 0317 Oslo,<country>Norway</country>
</nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">Nature Communications</title>
<idno type="eISSN">2041-1723</idno>
<imprint><date when="2016">2016</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><p>Bacterial genomes vary extensively in terms of both gene content and gene sequence. This plasticity hampers the use of traditional SNP-based methods for identifying all genetic associations with phenotypic variation. Here we introduce a computationally scalable and widely applicable statistical method (SEER) for the identification of sequence elements that are significantly enriched in a phenotype of interest. SEER is applicable to tens of thousands of genomes by counting variable-length k-mers using a distributed string-mining algorithm. Robust options are provided for association analysis that also correct for the clonal population structure of bacteria. Using large collections of genomes of the major human pathogens <italic>Streptococcus pneumoniae</italic>
 and <italic>Streptococcus pyogenes</italic>
, SEER identifies relevant previously characterized resistance determinants for several antibiotics and discovers potential novel factors related to the invasiveness of <italic>S. pyogenes</italic>
. We thus demonstrate that our method can answer important biologically and medically relevant questions.</p>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Falush, D" uniqKey="Falush D">D. Falush</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Chen, P E" uniqKey="Chen P">P. E. Chen</name>
</author>
<author><name sortKey="Shapiro, B J" uniqKey="Shapiro B">B. J. Shapiro</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Farhat, M R" uniqKey="Farhat M">M. R. Farhat</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Liu, J Z" uniqKey="Liu J">J. Z. Liu</name>
</author>
<author><name sortKey="Anderson, C A" uniqKey="Anderson C">C. A. Anderson</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sheppard, S K" uniqKey="Sheppard S">S. K. Sheppard</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Chewapreecha, C" uniqKey="Chewapreecha C">C. Chewapreecha</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Laabei, M" uniqKey="Laabei M">M. Laabei</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Weinert, L" uniqKey="Weinert L">L. Weinert</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zerbino, D R" uniqKey="Zerbino D">D. R. Zerbino</name>
</author>
<author><name sortKey="Birney, E" uniqKey="Birney E">E. Birney</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Gardner, S N" uniqKey="Gardner S">S. N. Gardner</name>
</author>
<author><name sortKey="Hall, B G" uniqKey="Hall B">B. G. Hall</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ondov, B D" uniqKey="Ondov B">B. D. Ondov</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Evangelou, E" uniqKey="Evangelou E">E. Evangelou</name>
</author>
<author><name sortKey="Ioannidis, J P A" uniqKey="Ioannidis J">J. P. A. Ioannidis</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Chewapreecha, C" uniqKey="Chewapreecha C">C. Chewapreecha</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Rizk, G" uniqKey="Rizk G">G. Rizk</name>
</author>
<author><name sortKey="Lavenier, D" uniqKey="Lavenier D">D. Lavenier</name>
</author>
<author><name sortKey="Chikhi, R" uniqKey="Chikhi R">R. Chikhi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Spain, S L" uniqKey="Spain S">S. L. Spain</name>
</author>
<author><name sortKey="Barrett, J C" uniqKey="Barrett J">J. C. Barrett</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Croucher, N J" uniqKey="Croucher N">N. J. Croucher</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Croucher, N J" uniqKey="Croucher N">N. J. Croucher</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Maskell, J P" uniqKey="Maskell J">J. P. Maskell</name>
</author>
<author><name sortKey="Sefton, A M" uniqKey="Sefton A">A. M. Sefton</name>
</author>
<author><name sortKey="Hall, L M" uniqKey="Hall L">L. M. Hall</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ng, P C" uniqKey="Ng P">P. C. Ng</name>
</author>
<author><name sortKey="Henikoff, S" uniqKey="Henikoff S">S. Henikoff</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Steer, A C" uniqKey="Steer A">A. C. Steer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Seale, A C" uniqKey="Seale A">A. C. Seale</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Roberts, A P" uniqKey="Roberts A">A. P. Roberts</name>
</author>
<author><name sortKey="Mullany, P" uniqKey="Mullany P">P. Mullany</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Dubnau, D" uniqKey="Dubnau D">D. Dubnau</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lefebure, T" uniqKey="Lefebure T">T. Lefébure</name>
</author>
<author><name sortKey="Stanhope, M J" uniqKey="Stanhope M">M. J. Stanhope</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Raeder, R" uniqKey="Raeder R">R. Raeder</name>
</author>
<author><name sortKey="Boyle, M D" uniqKey="Boyle M">M. D. Boyle</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Raeder, R" uniqKey="Raeder R">R. Raeder</name>
</author>
<author><name sortKey="Boyle, M D" uniqKey="Boyle M">M. D. Boyle</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Smith, T C" uniqKey="Smith T">T. C. Smith</name>
</author>
<author><name sortKey="Sledjeski, D D" uniqKey="Sledjeski D">D. D. Sledjeski</name>
</author>
<author><name sortKey="Boyle, M D P" uniqKey="Boyle M">M. D. P. Boyle</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Smith, T C" uniqKey="Smith T">T. C. Smith</name>
</author>
<author><name sortKey="Sledjeski, D D" uniqKey="Sledjeski D">D. D. Sledjeski</name>
</author>
<author><name sortKey="Boyle, M D P" uniqKey="Boyle M">M. D. P. Boyle</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Walker, M J" uniqKey="Walker M">M. J. Walker</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="V Lim Ki, N" uniqKey="V Lim Ki N">N. Välimäki</name>
</author>
<author><name sortKey="Puglisi, S" uniqKey="Puglisi S">S. Puglisi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Seth, S" uniqKey="Seth S">S. Seth</name>
</author>
<author><name sortKey="V Lim Ki, N" uniqKey="V Lim Ki N">N. Välimäki</name>
</author>
<author><name sortKey="Kaski, S" uniqKey="Kaski S">S. Kaski</name>
</author>
<author><name sortKey="Honkela, A" uniqKey="Honkela A">A. Honkela</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Gog, S" uniqKey="Gog S">S. Gog</name>
</author>
<author><name sortKey="Beller, T" uniqKey="Beller T">T. Beller</name>
</author>
<author><name sortKey="Moffat, A" uniqKey="Moffat A">A. Moffat</name>
</author>
<author><name sortKey="Petri, M" uniqKey="Petri M">M. Petri</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Price, A L" uniqKey="Price A">A. L. Price</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Chengsong, Z" uniqKey="Chengsong Z">Z. Chengsong</name>
</author>
<author><name sortKey="Jianming, Y" uniqKey="Jianming Y">Y. Jianming</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Tasoulis, S" uniqKey="Tasoulis S">S. Tasoulis</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Cheng, L" uniqKey="Cheng L">L. Cheng</name>
</author>
<author><name sortKey="Connor, T R" uniqKey="Connor T">T. R. Connor</name>
</author>
<author><name sortKey="Siren, J" uniqKey="Siren J">J. Sirén</name>
</author>
<author><name sortKey="Aanensen, D M" uniqKey="Aanensen D">D. M. Aanensen</name>
</author>
<author><name sortKey="Corander, J" uniqKey="Corander J">J. Corander</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Heinze, G" uniqKey="Heinze G">G. Heinze</name>
</author>
<author><name sortKey="Schemper, M" uniqKey="Schemper M">M. Schemper</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ford, C B" uniqKey="Ford C">C. B. Ford</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sanderson, C" uniqKey="Sanderson C">C. Sanderson</name>
</author>
<author><name sortKey="Curtin, R" uniqKey="Curtin R">R. Curtin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="King, D E" uniqKey="King D">D. E. King</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kent, W J" uniqKey="Kent W">W. J. Kent</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Li, H" uniqKey="Li H">H. Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Cingolani, P" uniqKey="Cingolani P">P. Cingolani</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Dalquen, D A" uniqKey="Dalquen D">D. a Dalquen</name>
</author>
<author><name sortKey="Anisimova, M" uniqKey="Anisimova M">M. Anisimova</name>
</author>
<author><name sortKey="Gonnet, G H" uniqKey="Gonnet G">G. H. Gonnet</name>
</author>
<author><name sortKey="Dessimoz, C" uniqKey="Dessimoz C">C. Dessimoz</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Chen, J Q" uniqKey="Chen J">J. Q. Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Hu, X" uniqKey="Hu X">X. Hu</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Cartwright, R A" uniqKey="Cartwright R">R. a. Cartwright</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kosiol, C" uniqKey="Kosiol C">C. Kosiol</name>
</author>
<author><name sortKey="Holmes, I" uniqKey="Holmes I">I. Holmes</name>
</author>
<author><name sortKey="Goldman, N" uniqKey="Goldman N">N. Goldman</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Newman, S C" uniqKey="Newman S">S. C. Newman</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">Nat Commun</journal-id>
<journal-id journal-id-type="iso-abbrev">Nat Commun</journal-id>
<journal-title-group><journal-title>Nature Communications</journal-title>
</journal-title-group>
<issn pub-type="epub">2041-1723</issn>
<publisher><publisher-name>Nature Publishing Group</publisher-name>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">27633831</article-id>
<article-id pub-id-type="pmc">5028413</article-id>
<article-id pub-id-type="pii">ncomms12797</article-id>
<article-id pub-id-type="doi">10.1038/ncomms12797</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Article</subject>
</subj-group>
</article-categories>
<title-group><article-title>Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes</article-title>
</title-group>
<contrib-group><contrib contrib-type="author"><name><surname>Lees</surname>
<given-names>John A.</given-names>
</name>
<xref ref-type="aff" rid="a1">1</xref>
<xref ref-type="author-notes" rid="n1">*</xref>
<contrib-id contrib-id-type="orcid">http://orcid.org/0000-0001-5360-1254</contrib-id>
</contrib>
<contrib contrib-type="author"><name><surname>Vehkala</surname>
<given-names>Minna</given-names>
</name>
<xref ref-type="aff" rid="a2">2</xref>
<xref ref-type="author-notes" rid="n1">*</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Välimäki</surname>
<given-names>Niko</given-names>
</name>
<xref ref-type="aff" rid="a3">3</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Harris</surname>
<given-names>Simon R.</given-names>
</name>
<xref ref-type="aff" rid="a1">1</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Chewapreecha</surname>
<given-names>Claire</given-names>
</name>
<xref ref-type="aff" rid="a4">4</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Croucher</surname>
<given-names>Nicholas J.</given-names>
</name>
<xref ref-type="aff" rid="a5">5</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Marttinen</surname>
<given-names>Pekka</given-names>
</name>
<xref ref-type="aff" rid="a6">6</xref>
<xref ref-type="aff" rid="a7">7</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Davies</surname>
<given-names>Mark R.</given-names>
</name>
<xref ref-type="aff" rid="a8">8</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Steer</surname>
<given-names>Andrew C.</given-names>
</name>
<xref ref-type="aff" rid="a9">9</xref>
<xref ref-type="aff" rid="a10">10</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Tong</surname>
<given-names>Steven Y. C.</given-names>
</name>
<xref ref-type="aff" rid="a11">11</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Honkela</surname>
<given-names>Antti</given-names>
</name>
<xref ref-type="aff" rid="a12">12</xref>
<contrib-id contrib-id-type="orcid">http://orcid.org/0000-0001-9193-8093</contrib-id>
</contrib>
<contrib contrib-type="author"><name><surname>Parkhill</surname>
<given-names>Julian</given-names>
</name>
<xref ref-type="aff" rid="a1">1</xref>
<contrib-id contrib-id-type="orcid">http://orcid.org/0000-0002-7069-5958</contrib-id>
</contrib>
<contrib contrib-type="author"><name><surname>Bentley</surname>
<given-names>Stephen D.</given-names>
</name>
<xref ref-type="aff" rid="a1">1</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Corander</surname>
<given-names>Jukka</given-names>
</name>
<xref ref-type="corresp" rid="c1">a</xref>
<xref ref-type="aff" rid="a1">1</xref>
<xref ref-type="aff" rid="a2">2</xref>
<xref ref-type="aff" rid="a13">13</xref>
</contrib>
<aff id="a1"><label>1</label>
<institution>Pathogen Genomics, Wellcome Trust Sanger Institute</institution>
, Cambridge CB10 1SA,<country>UK</country>
</aff>
<aff id="a2"><label>2</label>
<institution>Department of Mathematics and Statistics, University of Helsinki</institution>
, Helsinki FI-00014,<country>Finland</country>
</aff>
<aff id="a3"><label>3</label>
<institution>Department of Medical and Clinical Genetics, Genome-Scale Biology Research Program, University of Helsinki</institution>
, Helsinki FI-00014,<country>Finland</country>
</aff>
<aff id="a4"><label>4</label>
<institution>Department of Medicine, University of Cambridge</institution>
, Cambridge CB2 0SP,<country>UK</country>
</aff>
<aff id="a5"><label>5</label>
<institution>Department of Infectious Disease Epidemiology, Imperial College</institution>
, London W2 1NY,<country>UK</country>
</aff>
<aff id="a6"><label>6</label>
<institution>Department of Computer Science, Aalto University</institution>
, Espoo FI-00076,<country>Finland</country>
</aff>
<aff id="a7"><label>7</label>
<institution>Helsinki Institute of Information Technology HIIT, Department of Computer Science, Aalto University</institution>
, Espoo FI-00076,<country>Finland</country>
</aff>
<aff id="a8"><label>8</label>
<institution>Department of Microbiology and Immunology, Peter Doherty Institute for Infection and Immunity, University of Melbourne</institution>
, Melbourne, Victoria 3010,<country>Australia</country>
</aff>
<aff id="a9"><label>9</label>
<institution>Centre for International Child Health, Department of Paediatrics, University of Melbourne</institution>
, Melbourne, Victoria 3052,<country>Australia</country>
</aff>
<aff id="a10"><label>10</label>
<institution>Group A Streptococcal Research Group, Murdoch Children's Research Institute</institution>
, Parkville, Victoria 3052,<country>Australia</country>
</aff>
<aff id="a11"><label>11</label>
<institution>Menzies School of Health Research</institution>
, Darwin, Northern Territory 0811,<country>Australia</country>
</aff>
<aff id="a12"><label>12</label>
<institution>Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki</institution>
, Helsinki FI-00014,<country>Finland</country>
</aff>
<aff id="a13"><label>13</label>
<institution>Department of Biostatistics, University of Oslo</institution>
, 0317 Oslo,<country>Norway</country>
</aff>
</contrib-group>
<author-notes><corresp id="c1"><label>a</label>
<email>jukka.corander@helsinki.fi</email>
</corresp>
<fn id="n1"><label>*</label>
<p>These authors contributed equally to this work.</p>
</fn>
</author-notes>
<pub-date pub-type="epub"><day>16</day>
<month>09</month>
<year>2016</year>
</pub-date>
<pub-date pub-type="collection"><year>2016</year>
</pub-date>
<volume>7</volume>
<elocation-id>12797</elocation-id>
<history><date date-type="received"><day>05</day>
<month>01</month>
<year>2016</year>
</date>
<date date-type="accepted"><day>28</day>
<month>07</month>
<year>2016</year>
</date>
</history>
<permissions><copyright-statement>Copyright © 2016, The Author(s)</copyright-statement>
<copyright-year>2016</copyright-year>
<copyright-holder>The Author(s)</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/"><pmc-comment>author-paid</pmc-comment>
          <license-p>This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
</license-p>
</license>
</permissions>
<abstract><p>Bacterial genomes vary extensively in terms of both gene content and gene sequence. This plasticity hampers the use of traditional SNP-based methods for identifying all genetic associations with phenotypic variation. Here we introduce a computationally scalable and widely applicable statistical method (SEER) for the identification of sequence elements that are significantly enriched in a phenotype of interest. SEER is applicable to tens of thousands of genomes by counting variable-length k-mers using a distributed string-mining algorithm. Robust options are provided for association analysis that also correct for the clonal population structure of bacteria. Using large collections of genomes of the major human pathogens <italic>Streptococcus pneumoniae</italic>
 and <italic>Streptococcus pyogenes</italic>
, SEER identifies relevant previously characterized resistance determinants for several antibiotics and discovers potential novel factors related to the invasiveness of <italic>S. pyogenes</italic>
. We thus demonstrate that our method can answer important biologically and medically relevant questions.</p>
</abstract>
<abstract abstract-type="web-summary"><p><inline-graphic id="i1" xlink:href="ncomms12797-i1.jpg"></inline-graphic>
Plasticity and clonal population structure in bacterial genomes can hinder traditional SNP-based genetic association studies. Here, Corander and colleagues present a method to identify variable-length sequence elements enriched in a phenotype of interest, and demonstrate its use in human pathogens.</p>
</abstract>
</article-meta>
</front>
<body><p>The rapidly expanding repositories of genomic data for bacteria hold an enormous and yet largely untapped potential for building a more detailed understanding of the evolutionary responses to changing environmental conditions, such as the widespread use of antibiotics and switches between host-niche as farming practices change.</p>
<p>Studies attempting to determine the genetic basis of bacterial traits have traditionally been limited to identifying emerging clones, which are associated with the phenotype of interest, rather than identifying the specific causal genetic elements<xref ref-type="bibr" rid="b1">1</xref>
. This is partly due to the fact that bacteria reproduce clonally, meaning that a large proportion of the genome is in linkage disequilibrium (LD) with any given trait<xref ref-type="bibr" rid="b2">2</xref>
. The ability of any method to determine which of this large list of variants associated with a trait is truly causal requires that the trait is not uniquely associated with a single clonal lineage. High-recombination rates observed in some species can also break up these large LD blocks, boosting the potential power of an association study to discover the causal variant(s).</p>
<p>For strongly selected traits caused by highly penetrant variants, such as antimicrobial resistance, scanning for homoplasy (convergent evolution) determined by ancestral state reconstruction has been shown to be successful at identifying the causal variant<xref ref-type="bibr" rid="b3">3</xref>
. However, finding variants which are not fully penetrant for a phenotype (as may be the case for clinically relevant traits such as virulence) requires large numbers of samples<xref ref-type="bibr" rid="b4">4</xref>
 and a more general test of association.</p>
<p>For these reasons, genome-wide association studies (GWAS) for bacterial phenotypes have only recently started to appear<xref ref-type="bibr" rid="b2">2</xref>
<xref ref-type="bibr" rid="b5">5</xref>
<xref ref-type="bibr" rid="b6">6</xref>
<xref ref-type="bibr" rid="b7">7</xref>
<xref ref-type="bibr" rid="b8">8</xref>
. Use of standard GWAS methods developed originally for human single-nucleotide polymorphism (SNP) data have been shown to be successfully applicable to core genome mutations in bacteria<xref ref-type="bibr" rid="b6">6</xref>
<xref ref-type="bibr" rid="b7">7</xref>
. However, given the high level of genome plasticity of many of the known bacterial species, we can anticipate that such methods can only partially identify genetic determinants of phenotypic variation. To enable discovery of mechanisms related for instance to gene content, alternative alignment-free methods have also been introduced<xref ref-type="bibr" rid="b5">5</xref>
<xref ref-type="bibr" rid="b8">8</xref>
. These methods use k-mers, that is, DNA words of length k, as generalized alternatives to SNPs as putative explanations for observed differences in phenotype distributions. The main advantage of k-mers is their ability to capture several different types of variation present across a collection of genomes, including mutations, indels, recombinations, variable promoter architecture and differences in gene content as well as capturing these variations in regions not present in all genomes.</p>
<p>K-mers have been used in bacterial genomics for sequence assembly<xref ref-type="bibr" rid="b9">9</xref>
, SNP calling<xref ref-type="bibr" rid="b10">10</xref>
 and distance estimation<xref ref-type="bibr" rid="b11">11</xref>
. Previous GWAS studies using k-mers to overcome limitations of SNP-based association have used Monte-Carlo simulations of word gain and loss along an inferred phylogeny to control for population structure<xref ref-type="bibr" rid="b5">5</xref>
, whereas SNP-based studies have used clustering algorithms on a core alignment and stratified association tests on the resulting groups of samples<xref ref-type="bibr" rid="b6">6</xref>
<xref ref-type="bibr" rid="b7">7</xref>
. The former does not scale computationally to the hundreds of isolates required to find lower effect-size associations, and the latter requires a core alignment, which lacks sensitivity and is difficult to produce when there is a large number of samples, or they are particularly diverse.</p>
<p>Here we present sequence element enrichment analysis (SEER), a method computationally scalable to tens of thousands of genomes, implemented as a stand-alone pipeline that uses either <italic>de novo</italic>
 assembled contigs or raw read data as input. We apply SEER to both simulated and the real data from large and diverse populations, and show that it can accurately detect associations with antibiotic resistance caused by both presence of a gene and by SNPs in coding regions, as well as discover novel invasiveness factors.</p>
<sec disp-level="1"><title>Results</title>
<sec disp-level="2"><title>Implementation</title>
<p>SEER implements and combines three key insights, which we discuss in detail in the methods section: an efficient scan of all possible k-mers with a distributed string mining algorithm, an appropriate alignment-free correction for clonal population structure, and a fast and fully robust association analysis of all counted k-mers.</p>
<p>K-mers allow simultaneous discovery of both short genetic variants and entire genes associated with a phenotype. Longer k-mers provide higher specificity, but less sensitivity than shorter k-mers. Rather than arbitrarily selecting a length before analysis or having to count k-mers at multiple lengths and combine the results, we provide an efficient implementation that allows counting and testing simultaneously at all k-mers at lengths over 9 bases long.</p>
<p>An association test, using an appropriate correction for the clonal population structure, is performed on the counted k-mers. Those reaching significance are filtered post-association and mapped onto both a well-annotated reference sequence and the annotated draft assemblies to allow discovery of variation in accessory genes not present in the reference strain. The significant k-mers themselves can also be assembled into a longer consensus sequence. Annotating variants by predicted function and effect (against a reference sequence) in the resulting k-mers allows fine-mapping of SNPs and small indels.</p>
<p>Meta-analysis of association studies increases sample size, which improves power and reduces false-positive rates<xref ref-type="bibr" rid="b12">12</xref>
. To facilitate meta-analysis of k-mers across studies, the output of SEER includes effect size, direction and standard error, which can be used directly with existing software to meta-analyse all overlapping k-mers.</p>
<p>SEER is implemented in C++, and available at <ext-link ext-link-type="uri" xlink:href="https://github.com/johnlees/seer">https://github.com/johnlees/seer</ext-link>
 as source code, a precompiled binary, and a self-contained virtual machine.</p>
</sec>
<sec disp-level="2"><title>Application to simulated data</title>
<p>To test the power of SEER across different sample sizes, we simulated 3,069 <italic>Streptococcus pneumoniae</italic>
 genomes from the phylogeny observed in a Thai refugee camp<xref ref-type="bibr" rid="b13">13</xref>
 using parameters estimated from real data including accumulation of SNPs, indels (<xref ref-type="supplementary-material" rid="S1">Supplementary Fig. 1</xref>
), gene loss and recombination events. Using knowledge of the true alignments, we then artificially associated an accessory gene with a phenotype over a range of odds ratios and evaluated power at different sample sizes (<xref ref-type="fig" rid="f1">Fig. 1a</xref>
). The expected pattern for this power calculation is seen, with higher odds ratio effects being easier to detect. Currently detected associations in bacteria have had large effect sizes (OR>28 host-specificity<xref ref-type="bibr" rid="b5">5</xref>
, OR>3 beta-lactam resistance<xref ref-type="bibr" rid="b6">6</xref>
), and the required sample sizes predicted here are consistent with these discoveries.</p>
<p>The large k-mer diversity, along with the population stratification of gene loss, makes the simulated estimate of the sample size required to reach the stated power clearly conservative. Convergent evolution along multiple branches of a phylogeny for a real population reacting to selection pressures will reduce the required sample size<xref ref-type="bibr" rid="b3">3</xref>
.</p>
<p>We also used k-mers counted at constant lengths by DSK<xref ref-type="bibr" rid="b14">14</xref>
 to perform the gene presence/absence association (<xref ref-type="fig" rid="f1">Fig. 1b</xref>
). Counting all informative k-mers (see Methods) rather than a range of predefined k-mer lengths gives greater power to detect associations, with 80% power being reached at ∼1,500 samples, compared with 2,000 samples required by the predefined lengths. The slightly lower power at low sample numbers is due to a stricter Bonferroni adjustment being applied to the larger number of DSM k-mers over the DSK k-mers. This is exactly the expected advantage from including shorter k-mers to increase sensitivity, but as k-mers are correlated with each other due to evolving along the same phylogeny, using the same Bonferroni correction for multiple testing does not decrease specificity.</p>
<p>The strong LD caused by the clonal reproduction of bacterial populations means that non-causal k-mers may also appear to be associated. This is well-documented in human genetics; non-causal variants tag the causal variant increasing discovery power, but make it more difficult to fine-map the true link between genotype and phenotype<xref ref-type="bibr" rid="b15">15</xref>
. In simulations it is difficult to replicate the LD patterns observed in real populations, as recombination maps for specific bacterial lineages are not yet known. To evaluate fine-mapping power of a SNP we instead used the real sequence data and simulated phenotypes based on changing the effect size of a known causal variant and evaluating the physical distance of significant k-mers from the variant site.</p>
<p>Using DSM we counted 68M k-mers which we then tested for association. The 2,639 significant k-mers were mapped to a reference genome, and were found to cover most of the genome with a peak at the causal variant (<xref ref-type="supplementary-material" rid="S1">Supplementary Fig. 2</xref>
). Mapped k-mers were then placed into three categories: if they contained the causal variant I100L (10 k-mers), were within the same gene (74 k-mers), or within 2.5 kb in either direction (207 k-mers). <xref ref-type="fig" rid="f1">Figure 1c</xref>
 shows the resulting power when random subsamples of the population are taken. As expected, power is higher when not specifying that the causal variant must be hit, as there are many more k-mers which are in LD with the SNP than directly overlapping it, thus increasing sensitivity.</p>
</sec>
<sec disp-level="2"><title>Confirmation of known resistance mechanisms in <italic>S. pneumoniae</italic>
</title>
<p>SEER was applied to the sequenced genomes from the study described above<xref ref-type="bibr" rid="b6">6</xref>
, using measured resistance to five different antibiotics as the phenotype: chloramphenicol, erythromycin, β-lactams, tetracycline and trimethoprim. Chloramphenicol resistance is conferred by the <italic>cat</italic>
 gene, and tetracycline resistance is conferred by the <italic>tetM</italic>
 gene, both carried on the integrative conjugative element (ICE) ICE<italic>Sp</italic>
23FST81 in the <italic>S. pneumoniae</italic>
 ATCC 700669 chromosome<xref ref-type="bibr" rid="b16">16</xref>
. For both of these drug-resistance phenotypes the ICE contains 99% of the significant k-mers, and the causal genes rank highly within the clusters (<xref ref-type="table" rid="t1">Table 1</xref>
, <xref ref-type="supplementary-material" rid="S1">Supplementary Fig. 3</xref>
).</p>
<p>Resistance to erythromycin is also conferred by presence of a gene, but there are multiple genes that can be causal for this resistance: <italic>ermB</italic>
 causes resistance by methylating rRNA, whereas <italic>mef</italic>
/<italic>mel</italic>
 is an efflux pump system<xref ref-type="bibr" rid="b17">17</xref>
. In the population studied, this phenotype was strongly associated with two large lineages (<xref ref-type="supplementary-material" rid="S1">Supplementary Fig. 4</xref>
), making the task of disentangling association with a lineage versus a specific locus more difficult. Significant k-mers are found in the mega and omega cassettes, which carry the <italic>mel</italic>
/<italic>mef</italic>
 and <italic>ermB</italic>
 resistance elements, respectively.</p>
<p>Hits are also found to other sites within the ICE, a permease directly upstream of <italic>folP</italic>
, <italic>prfC</italic>
 and <italic>gatA</italic>
. Macrolide resistance cassettes frequently insert into the ICE in <italic>S. pneumoniae</italic>
, so it is in LD with the genes discussed above. In sulphamethoxazole-resistance <italic>folP</italic>
 is modified by small insertions, with which the adjacent permease is in LD with. Finally, <italic>prfC</italic>
 and <italic>gatA</italic>
 are both involved in translation, so could conceivably contain compensatory mutations when <italic>ermB</italic>
-mediated resistance is present. Further evidence of these compensatory mutations would be required to rule out the k-mers mapping to them simply being false positives driven by population structure.</p>
<p>Some k-mers do not map to the reference, as they are due to lineage specific associations with genetic elements not found in the reference strain. This highlights both the need to map to a close reference or draft assembly to interpret hits, as well as the importance of functional follow-up to validate potential hits from SEER.</p>
<p>Multiple mechanisms of resistance to β-lactams are possible<xref ref-type="bibr" rid="b6">6</xref>
. Here, we consider just the most important (that is, highest effect size) mutations, which are SNPs in the penicillin binding proteins <italic>pbp2x</italic>
, <italic>pbp2b</italic>
 and <italic>pbp1a</italic>
. In this case looking at highest coverage annotations finds these genes, but is not sufficient as so many k-mers are significant—either due to other mechanisms of resistance, physical linkage with causal variants or co-selection for resistance conferring mutations. Instead, selecting the k-mers with the most significant <italic>P</italic>
 values gives the top four hit loci as <italic>pbp2b</italic>
 (<italic>P</italic>
=10<sup>−132</sup>
), <italic>pbp2x</italic>
 (<italic>P</italic>
=10<sup>−96</sup>
), putative RNA pseudouridylate synthase UniParc B8ZPU5 (<italic>P</italic>
=10<sup>−92</sup>
) and <italic>pbp1a</italic>
 (<italic>P</italic>
=10<sup>−89</sup>
). The non-<italic>pbp</italic>
 hit is a homologue of a gene in linkage disequilibrium with <italic>pbp2b</italic>
, which would suggest mismapping rather than causation of resistance.</p>
<p>Trimethoprim resistance in <italic>S. pneumoniae</italic>
 is conferred by the SNP I100L in the <italic>folA/dyr</italic>
 gene<xref ref-type="bibr" rid="b18">18</xref>
. The <italic>dpr</italic>
 and <italic>dyr</italic>
 genes, which are adjacent in the genome, have the highest coverage of significant k-mers (<xref ref-type="fig" rid="f2">Fig. 2</xref>
, <xref ref-type="supplementary-material" rid="S1">Supplementary Fig. 2</xref>
). Following our fine-mapping procedure, we call four high-confidence SNPs that are predicted to be more likely to affect protein function than synonymous SNPs. One is the causal SNP, and the others appear to be hitchhikers in LD with I100L. By evaluating whether sites are conserved across the protein family<xref ref-type="bibr" rid="b19">19</xref>
, the known causal SNP is ranked as the highest variant, showing that in this case fine-mapping is possible using the output from SEER.</p>
<p>We then compared the results from SEER with the results from two existing methods (see Methods). The first method (implemented using plink) uses mapping of SNPs against a reference, followed by applying the Cochran–Mantel–Haenszel test at every variable site<xref ref-type="bibr" rid="b6">6</xref>
. The second uses DSK<xref ref-type="bibr" rid="b14">14</xref>
 to count k-mers of length 31, and a highly robust correction for population structure which scales to around 100 genomes<xref ref-type="bibr" rid="b5">5</xref>
.</p>
<p>Both SEER and association by core mapping of SNPs (using plink) identified resistances caused by presence of a gene, when it is present in the reference used for mapping (<xref ref-type="supplementary-material" rid="S1">Supplementary Table 1</xref>
). Both produce their most significant <italic>P</italic>
 values in the causal element, though SEER appears to have a lower false-positive rate. However, as demonstrated by chloramphenicol resistance, if not enough SNP calls are made in the causal gene this hinders fine-mapping. SNP-mediated resistance showed the same pattern since many other SNPs were ranked above the causal variant. In the case of β-lactam resistance both methods seem to perform equally well, likely due to the higher rate of recombination and the creation of mosaic <italic>pbp</italic>
 genes.</p>
<p>In addition, as for erythromycin resistance, when an element is not present in the reference it is not detectable in SNP-based association analysis. In such cases, multiple mappings against other reference genomes would have to be made, which is a tedious and computationally costly procedure.</p>
<p>Since the k-mer results from SEER are reference-free, the computational cost of mapping reads to different reference genomes is minimized as only the significant k-mers are mapped to all available references. Alternatively, the significant k-mers can be mapped to all draft assemblies in the study, at least one of which is guaranteed to contain the k-mer, to check whether any annotations are overlapped.</p>
<p>The small sample, combined with fixed length 31-mer approach (see Methods), did not reach significance for chloramphenicol, tetracycline or trimethoprim as the effect size of any k-mer is too small to be detected in the number of samples accessible by the method. Erythromycin had 19,307 hits, and β-lactams 419 hits, at between 1 and 2% minor allele frequency (MAF), which are all false positives that would likely have been excluded by a fully robust population structure correction method.</p>
</sec>
<sec disp-level="2"><title>Discovery of k-mers associated with <italic>S. pyogenes</italic>
 invasiveness</title>
<p>Most bacterial GWAS studies to date have searched for genotypic variants that contribute towards or completely explain antibiotic resistance phenotypes. As a proof of principle that SEER can be used for the discovery stage of sequence elements associated with other clinically important phenotypes, we applied our tool to 675 <italic>Streptococcus pyogenes</italic>
 (group A <italic>Streptococcus</italic>
) genomes obtained from population diversity studies for genetic signatures of invasive propensity.</p>
<p>We sequenced 347 isolates of <italic>S. pyogenes</italic>
 collected from Fiji<xref ref-type="bibr" rid="b20">20</xref>
 on the Illumina HiSeq platform, and combined this with 328 existing sequences from Kilifi, Kenya<xref ref-type="bibr" rid="b21">21</xref>
. We defined those isolated from blood, cerebrospinal fluid (CSF) or bronchopulmonary aspirate as invasive (<italic>n</italic>
=185), and those isolated from throat, skin or urine as non-invasive (<italic>n</italic>
=490). We ran SEER to determine k-mers significantly associated with invasion, followed by a BLAST of the k-mers with the nr/nt database to determine a suitable reference for mapping purposes. After mapping to this reference SNPs were called (see Methods).</p>
<p>After this preliminary analysis, the top hit was the <italic>tetM</italic>
 gene from a conjugative transposon (Tn<italic>916</italic>
) carried by 23% of isolates (<xref ref-type="supplementary-material" rid="S1">Supplementary Figs 5 and 6</xref>
). These elements are known to be variably present in the chromosome of <italic>S. pyogenes</italic>
<xref ref-type="bibr" rid="b22">22</xref>
, and the lack of co-segregation with population structure explains our power to discover the association. However, as a different proportion of the isolates from each collection were invasive (Fiji—13%; Kilifi—43%), the significant k-mers will also include elements specific to the Kilifi data set. Indeed, we found that this version of Tn<italic>916</italic>
 was never present in genomes collected from Fiji. To correct this geographic bias, we repeat the SEER analysis by including country of origin as a covariate in the regression. This analysis removed <italic>tetM</italic>
 as being significantly associated with invasiveness, highlighting the importance of such covariate considerations in performing association studies on large bacterial populations.</p>
<p>After applying this correction, we identified two significant hits (<xref ref-type="supplementary-material" rid="S1">Supplementary Fig. 7</xref>
). The first corresponds to SNPs associating a specific allele of <italic>pepF</italic>
 (Oligoendopeptidase F; UniProt P54124) with invasive isolates. This could indicate a recombination event, due to the high SNP density and discordance with vertical evolution with respect to the inferred phylogeny<xref ref-type="bibr" rid="b23">23</xref>
<xref ref-type="bibr" rid="b24">24</xref>
. The second hit represents SNPs in the intergenic region upstream of both IgG-binding protein H (<italic>sph</italic>
) and <italic>nrdI</italic>
 (ribonucleotide reductase). In support of these findings, previous work in murine models have found differential expression of <italic>sph</italic>
 during invasive disease<xref ref-type="bibr" rid="b25">25</xref>
<xref ref-type="bibr" rid="b26">26</xref>
<xref ref-type="bibr" rid="b27">27</xref>
, but little to no expression outside of this niche<xref ref-type="bibr" rid="b28">28</xref>
. If these k-mers were found to affect expression of the IgG-binding protein, this would be a plausible genetic mechanism affecting pathogenesis and invasive propensity<xref ref-type="bibr" rid="b29">29</xref>
. The association of both of these variations would have to be validated either <italic>in vitro</italic>
 or within a replication cohort, and functional follow-up such as RNA-seq may also aid in elucidating the role of these genetic variants in <italic>S. pyogenes</italic>
 pathogenesis.</p>
<p>In contrast, application of the existing association methods described above (plink and DSK) to this <italic>S. pyogenes</italic>
 population data set found no sites significantly associated with invasiveness. The Cochran–Mantel–Haenszel test (stratified by BAPS cluster) that uses SNPs called against a reference sequence failed to identify the <italic>tetM</italic>
 gene and transposon at these elements are not found in the reference sequence. Furthermore, the population structure of this data set is so diverse that 88 different BAPS clusters were found, which overcorrects for population structure when using the DSK method, leaving too few samples within each group to provide the power to discover associations.</p>
</sec>
</sec>
<sec disp-level="1"><title>Discussion</title>
<p>SEER is a reference independent, scalable pipeline capable of finding bacterial sequence elements associated with a range of phenotypes, while controlling for clonal population structure. The sequence elements can be interpreted in terms of protein function using sequence databases, and we have shown that even single causal variants can be fine-mapped using the SEER output.</p>
<p>Our use of all k-mers 9-100 bases long together with robust regression methods, and the ability to analyse very large sample sizes show improved sensitivity over existing methods. This provides a generic approach capable of analysing the rapidly increasing number of bacterial whole genome sequences linked with a range of different phenotypes. The output can readily be used in a meta-analysis of sequence elements to facilitate the combination of new studies with published data, increasing both discovery power and confirming the significance of results.</p>
<p>As with all association methods, our approach is limited by the amount of recombination and convergent evolution that occurs in the observed population, since the discovery of causal sequence elements is principally constrained by the extent of linkage disequilibrium. However, by introducing improved computational scalability and statistical sensitivity SEER significantly pushes the existing boundaries for answering important biologically and medically relevant questions.</p>
</sec>
<sec disp-level="1"><title>Methods</title>
<sec disp-level="2"><title>Counting informative k-mers in samples</title>
<p>We offer three different methods to count k-mers in all samples in a study. For very large studies, or for counting directly from reads rather than assemblies, we provide an implementation of distributed string mining (DSM)<xref ref-type="bibr" rid="b30">30</xref>
<xref ref-type="bibr" rid="b31">31</xref>
, which limits maximum memory usage per core, but requires a large cluster to run. For data sets up to around 5,000 sample assemblies we have implemented a single core version fsm-lite. For comparison with older data sets, or where resources do not allow the storage of the entire k-mer index in memory, DSK<xref ref-type="bibr" rid="b14">14</xref>
 is used to count a single k-mer length in each sample individually, the results of which are then combined.</p>
<p>Over all <italic>N</italic>
 samples, all k-mers over 9 bases long that occur in more than one sample are counted. All non-informative k-mers are omitted from the output; a k-mer <italic>Z</italic>
 is not informative if any one base extension to the left (<italic>aZ</italic>
) or right (<italic>Za</italic>
) has exactly the same frequency support vector as <italic>Z</italic>
. The frequency support vector has <italic>N</italic>
 entries, each being the number of occurrences of k-mer <italic>Z</italic>
 in each sample. Further filtering conditions are explained in the sections below.</p>
<p>DSM<xref ref-type="bibr" rid="b30">30</xref>
<xref ref-type="bibr" rid="b31">31</xref>
 parallelizes to as much as one sample per core, and either 16 or 64 master server processes. DSM includes an optional entropy-filtering setting that filters the output k-mers based on both number of samples present and frequency distribution. On our 3,069 simulated genomes this took 2 h 38 min on 16 cores, and used 1 Gb RAM. The distributed approach is applicable up to terabytes of short-read data<xref ref-type="bibr" rid="b31">31</xref>
, but requires a cluster environment to run. As an easy-to-use alternative, we propose a single-core version of DSM that is applicable for gigabyte-scale data. We implemented the single core version based on a succinct data structure library<xref ref-type="bibr" rid="b32">32</xref>
 to produce the same output as DSM. On 675 <italic>S. pyogenes</italic>
 genomes this took 3 h 44 min and used 22.3 Gb RAM.</p>
<p>To count single k-mer lengths, an associative array was used to combine the results from DSK in memory. We concatenated results from k-mer lengths of 21, 31 and 41, as in previous studies<xref ref-type="bibr" rid="b5">5</xref>
. This can scale to large genome numbers by instead using external sorting to avoid storing the entire array in memory.</p>
</sec>
<sec disp-level="2"><title>Filtering k-mers</title>
<p>Before testing for association we filter k-mers based on their frequency and unadjusted <italic>P</italic>
 value to reduce false positives from testing underpowered k-mers and reduce computational time.</p>
<p>K-mers are filtered if either they appear in <1% or >99% of samples, or are over 100 bases long. We also test if the <italic>P</italic>
 value of association in a simple <italic>χ</italic>
<sup>2</sup>
-test (1 d.f.) is <10<sup>−5</sup>
, as in simulations this was true for all true positives, and remove it otherwise. In the case of a continuous phenotype a Welch two-sample <italic>t</italic>
-test is used instead.</p>
<p>The effect of this filtering step can be seen by plotting the unadjusted and adjusted <italic>P</italic>
 values of the k-mers from the simulated data set against each other (<xref ref-type="supplementary-material" rid="S1">Supplementary Figs 8 and 9</xref>
). Four hundred thirty k-mers of 12.7M passing frequency filtering have an unadjusted <italic>P</italic>
 value which falls below the <italic>χ</italic>
<sup>2</sup>
 significance threshold, but would be significant using the adjusted test (and have a positive direction of effect). These k-mers are all short words (10–21 bases; median 12) that appear multiple times per sample, and therefore are of low specificity. Testing the top <italic>P</italic>
 value k-mer in this set showed a strong association of the presence/absence vector with three population structure covariates used (<italic>P</italic>
=1.35e−24; <italic>P</italic>
 =1.15e−46; <italic>P</italic>
 =1.53e−09, respectively). Using lasso regression, the first population structure covariate has a higher effect in the model than the k-mer frequency vector (<xref ref-type="supplementary-material" rid="S1">Supplementary Fig. 10</xref>
). Altogether, this suggests that these filtered k-mers are associated to a lineage related to the phenotype, but are unlikely to be causal for the phenotype themselves. To confirm this, we mapped these k-mers back to the reference sequence. None of these k-mers map to the gene causal to the phenotype.</p>
</sec>
<sec disp-level="2"><title>Covariates to control for population structure</title>
<p>To correct for the clonal population structure of bacterial populations, a distance matrix is constructed from a random subsample of these k-mers, on which metric multi-dimensional scaling (MDS) is performed (<xref ref-type="supplementary-material" rid="S1">Supplementary Fig. 11</xref>
). This is analogous to the standard method used in human genetics of using principal components of the SNP matrix to correct for divergent ancestry<xref ref-type="bibr" rid="b33">33</xref>
<xref ref-type="bibr" rid="b34">34</xref>
, but has the advantage that no core gene alignment or SNP calling is needed, so can be directly applied to the k-mer counting result. Compared with modelling SNP variation, the use of k-mers as variable sequence elements has been previously shown to accurately estimate bacterial population structure<xref ref-type="bibr" rid="b35">35</xref>
.</p>
<p>A random sample of between 0.1% and 1% of k-mers appearing in between 5 and 95% of isolates is taken. We then construct a pairwise distance matrix <bold>D</bold>
, with each element being equal to a sum over all <italic>m</italic>
 sampled k-mers:</p>
<p><disp-formula id="eq1"><inline-graphic id="d33e903" xlink:href="ncomms12797-m1.jpg"></inline-graphic>
</disp-formula>
</p>
<p>where <italic>k</italic>
<sub><italic>im</italic>
</sub>
 is 1 if the <italic>m</italic>
th sampled k-mer is present in sample <italic>i</italic>
, and 0 otherwise. Each element <italic>d</italic>
<sub><italic>ij</italic>
</sub>
 is therefore an estimate of the number of non-shared k-mers between a pair of samples <italic>i</italic>
 and <italic>j</italic>
. Clustering samples using these distances gives the same results as clustering core alignment SNPs using hierBAPS<xref ref-type="bibr" rid="b36">36</xref>
 (<xref ref-type="supplementary-material" rid="S1">Supplementary Fig. 12</xref>
), which has been used in previous bacterial GWAS studies to correct for population structure.</p>
<p>Metric MDS is applied to <bold>D</bold>
, projecting these distances into a reduced number of dimensions. The normalized eigenvectors of each dimension are used as covariates in the regression model. The number of dimensions used is a user-adjustable parameter, and can be evaluated by the goodness-of-fit and the magnitude of the eigenvalues. In species tree with two lineages and 96 isolates, one dimension was sufficient as a population control (<xref ref-type="supplementary-material" rid="S1">Supplementary Fig. 13</xref>
), whereas for the larger collection of 3,069 isolates 10–15 dimensions were needed to give tight control (<xref ref-type="supplementary-material" rid="S1">Supplementary Fig. 14</xref>
). Over all our studies, generally three dimensions appeared a good trade-off between sensitivity and specificity.</p>
</sec>
<sec disp-level="2"><title>Logistic and linear regression</title>
<p>For each k-mer, a logistic curve is fitted to binary phenotype data, and a linear model to continuous data, using a time efficient optimization routine to allow testing of all k-mers. Bacteria can be subject to extremely strong selection pressures, producing common variants with very large effect sizes, such as antibiotics inducing resistance-conferring variants. This can make the data perfectly separable, and consequently the maximum likelihood estimate ceases to exist for the logistic model. Firth regression<xref ref-type="bibr" rid="b37">37</xref>
 has been used to obtain results in these cases.</p>
<p>For samples with binary outcome vector <italic><bold>y</bold>
</italic>
, for each k-mer a logistic model is fitted:</p>
<p><disp-formula id="eq2"><inline-graphic id="d33e962" xlink:href="ncomms12797-m2.jpg"></inline-graphic>
</disp-formula>
</p>
<p>where absence and presence for each k-mer are coded as 0 and 1, respectively, in column 2 of the design matrix <bold>X</bold>
 (column 1 is a vector of ones, giving an intercept term). Subsequent columns <italic>j</italic>
 of <bold>X</bold>
 contain the eigenvectors of the MDS projection, user-supplied categorical covariates (dummy encoded), and quantitative covariates (normalized). The Broyden–Fletcher–Goldfarb–Shanno algorithm is used to maximize the log likelihood <italic>L</italic>
 in terms of the gradient vector <italic><bold>β</bold>
</italic>
 (using an analytic expression for d(log <italic>L)</italic>
/d<italic><bold>β</bold>
</italic>
:</p>
<p><disp-formula id="eq3"><inline-graphic id="d33e991" xlink:href="ncomms12797-m3.jpg"></inline-graphic>
</disp-formula>
</p>
<p>where sig is the sigmoid function. If this fails to converge, <italic>n</italic>
 Newton–Raphson iterations are applied to <italic><bold>β</bold>
</italic>
:</p>
<p><disp-formula id="eq4"><inline-graphic id="d33e1003" xlink:href="ncomms12797-m4.jpg"></inline-graphic>
</disp-formula>
</p>
<p>from a starting point using the mean phenotype as the intercept, and the root-mean squared beta from a test of k-mers passing filtering</p>
<p><disp-formula id="eq5"><inline-graphic id="d33e1008" xlink:href="ncomms12797-m5.jpg"></inline-graphic>
</disp-formula>
</p>
<p><disp-formula id="eq6"><inline-graphic id="d33e1011" xlink:href="ncomms12797-m6.jpg"></inline-graphic>
</disp-formula>
</p>
<p>which is slower, but has a higher success rate. If this fails to converge due to the observed points being separable, or the s.e. of the slope is >3 (which empirically indicated almost separable data, with no counts in one element of the contingency table), Firth logistic regression is then applied. This adds an adjustment to log <italic>L</italic>
:</p>
<p><disp-formula id="eq7"><inline-graphic id="d33e1020" xlink:href="ncomms12797-m7.jpg"></inline-graphic>
</disp-formula>
</p>
<p>using which Newton–Raphson iterations are applied as above.</p>
<p>In the case of a continuous phenotype a linear model is fitted:</p>
<p><disp-formula id="eq8"><inline-graphic id="d33e1027" xlink:href="ncomms12797-m8.jpg"></inline-graphic>
</disp-formula>
</p>
<p>The squared distance U(<italic><bold>β</bold>
</italic>
)</p>
<p><disp-formula id="eq9"><inline-graphic id="d33e1036" xlink:href="ncomms12797-m9.jpg"></inline-graphic>
</disp-formula>
</p>
<p>is minimized using the Broyden–Fletcher–Goldfarb–Shanno algorithm. If this fails to converge then the analytic solution is obtained by orthogonal decomposition:</p>
<p><disp-formula id="eq10"><inline-graphic id="d33e1041" xlink:href="ncomms12797-m10.jpg"></inline-graphic>
</disp-formula>
</p>
<p>then back solving for <bold>β</bold>
 in:</p>
<p><disp-formula id="eq11"><inline-graphic id="d33e1049" xlink:href="ncomms12797-m11.jpg"></inline-graphic>
</disp-formula>
</p>
<p>In both cases the s.e. on <italic>β</italic>
<sub>1</sub>
 is calculated by inverting the Fisher information matrix d<sup>2</sup>
<italic>L</italic>
/d<italic><bold>β</bold>
</italic>
<sup>2</sup>
 (inversions are performed by Cholesky decomposition, or if this fails due to the matrix being almost singular the Moore–Penrose pseudoinverse is taken) to obtain the variance-covariance matrix. The Wald statistic is calculated with the null hypothesis of no association (<italic>β</italic>
<sub>1</sub>
=0):</p>
<p><disp-formula id="eq12"><inline-graphic id="d33e1077" xlink:href="ncomms12797-m12.jpg"></inline-graphic>
</disp-formula>
</p>
<p>which is the test statistic of a <italic>χ</italic>
<sup>2</sup>
 distribution with 1 d.f. This is equivalent to the positive tail of a standard normal distribution, the integral of which gives the <italic>P</italic>
 value.</p>
</sec>
<sec disp-level="2"><title>Significance cutoff</title>
<p>For the basal cutoff for significance we use <italic>P</italic>
<0.05, which in our testing we conservatively Bonferroni corrected to the threshold 1 × 10<sup>−8</sup>
 based on every position in the <italic>S. pneumoniae</italic>
 genome having three possible mutations<xref ref-type="bibr" rid="b38">38</xref>
, and all this variation being uncorrelated. This is a strict cutoff level that prevents a large number of false positives due to the extensive amount of k-mers being tested, but does not over-penalize by correcting directly on the basis of the number of k-mers counted. To calculate an empirical significance testing cutoff for the <italic>P</italic>
 value under multiple correlated tests, we observed the distribution of <italic>P</italic>
 values from 100 random permutations of phenotype. For the 3,069 Thai genomes setting the family-wise error rate at 0.05 gave a cutoff of 1.4 × 10<sup>−8</sup>
, supporting the above reasoning.</p>
<p>In general, the number of k-mers and the correlations between their frequency vectors will vary depending on the species and specific samples in the study, so the <italic>P</italic>
 value cutoff should be chosen in this manner (either by considering possible variation given the genome length, or by permutation testing) for each individual study. Association effect size and <italic>P</italic>
 value of the MDS components are also included in the output, to compare lineage and variant effects on the phenotype variation.</p>
</sec>
<sec disp-level="2"><title>SEER implementation</title>
<p>SEER is implemented in C++ using the armadillo linear algebra library<xref ref-type="bibr" rid="b39">39</xref>
, and dlib optimization library<xref ref-type="bibr" rid="b40">40</xref>
. On a simulation of 3,069 diverse 0.4 Mb genomes, 143M k-mers were counted by DSM and 25M 31-mers by DSK. On the largest DSM set, using 16 cores and subsampling 0.3M k-mers (0.2% of the total), calculating population covariates took 6 h 42 min and 8.33 GB RAM. This step is <italic>O</italic>
(<italic>N</italic>
<sup>2</sup>
<italic>M</italic>
) where <italic>N</italic>
 is number of samples and <italic>M</italic>
 is number of k-mers, but can be parallelized across up to <italic>N</italic>
<sup>2</sup>
 cores.</p>
<p>Processing all 143M informative k-mers as described took 69 min 44 s and 23 MB RAM on 16 cores. This step is <italic>O</italic>
(<italic>M</italic>
) and can be parallelized across up to <italic>M</italic>
 cores.</p>
<p>On the real data set of full-length genomes the 68M informative k-mers counted was less than the simulated data set above, as the parameters of the simulation created particularly diverse final genomes (<xref ref-type="supplementary-material" rid="S1">Supplementary methods</xref>
).</p>
</sec>
<sec disp-level="2"><title>Interpreting significant k-mers</title>
<p>K-mers reaching the threshold for significance are then post-association filtered requiring <italic>β</italic>
<sub>1</sub>
>0 as a negative effect size does not make biological sense. Remaining k-mers are searched for by exact match in their <italic>de novo</italic>
 assemblies, and annotations of features examined for overlap of function. BLAT<xref ref-type="bibr" rid="b41">41</xref>
 is also used with a step size of 2 and minimum match size of 15 to find inexact but close matches to a well-annotated reference sequence.</p>
<p>To better search for gene clusters associated with phenotype, these k-mers are assembled using Velvet<xref ref-type="bibr" rid="b9">9</xref>
 choosing a smaller sub-k-mer size, which maximizes longest contig length of the final assembly. K-mers that are then substrings of others significant k-mers are removed.</p>
<p>Small k-mers are more likely than full reads to map equally well to multiple places in the reference genome, so reporting both mappings increases the sensitivity. For this data set an average of 21% of k-mers significantly associated with antibiotic resistance report secondary mappings. These k-mers are short (median 15 bp), and therefore have low specificity and high sensitivity as expected.</p>
</sec>
<sec disp-level="2"><title>Mapping of a single SNP</title>
<p>Using the BLAT mapping of significant k-mers to a reference sequence, SNPs are called using bcftools<xref ref-type="bibr" rid="b42">42</xref>
. Quality scores for a read are set to be identical, and are set as the Phred-scaled Holm-adjusted <italic>P</italic>
 values from association. High-quality (QUAL>100) SNPs are then annotated for function using SnpEff<xref ref-type="bibr" rid="b43">43</xref>
, and the effect of missense SNPs on protein function is ranked using SIFT<xref ref-type="bibr" rid="b19">19</xref>
.</p>
</sec>
<sec disp-level="2"><title>Comparison with existing methods</title>
<p>We compare with two existing methods. The first uses a core-genome SNP mapping along with population clusters defined from the same alignment to perform a Cochran–Mantel–Haenszel test at every called variant site<xref ref-type="bibr" rid="b6">6</xref>
. The second uses a fixed k-mer length of 31 as counted by DSK<xref ref-type="bibr" rid="b14">14</xref>
, with a Monte Carlo phylogeny-based population control<xref ref-type="bibr" rid="b5">5</xref>
. As the second method is not scalable to this population size we used our population control as calculated from all genomes in the population, and a subsample of 100 samples to calculate association statistics, which is roughly the number computationally accessible by this method. In both cases, the same Bonferroni correction is used as for SEER.</p>
</sec>
<sec disp-level="2"><title>Simulating bacterial populations</title>
<p>A random subset of 450 genes from the <italic>Streptococcus pneumoniae</italic>
 ATCC 700669 (ref. <xref ref-type="bibr" rid="b16">16</xref>
) strain were used as the starting genome for Artifical Life Framework (ALF)<xref ref-type="bibr" rid="b44">44</xref>
. ALF simulated 3,069 final genomes along the phylogeny observed in a Thai refugee camp<xref ref-type="bibr" rid="b13">13</xref>
. An alignment between <italic>S. pneumoniae</italic>
 strains R6, 19F and <italic>Streptococcus mitis</italic>
 B6 using Progressive Cactus was used to estimate rates in the GTR matrix and the size distribution of insertions and deletions (INDELs—<xref ref-type="supplementary-material" rid="S1">Supplementary fig. 3</xref>
). Previous estimates for the relative rate of SNPs to INDELs<xref ref-type="bibr" rid="b45">45</xref>
 and the rate of horizontal gene transfer and loss<xref ref-type="bibr" rid="b13">13</xref>
 were used.</p>
<p>pIRS<xref ref-type="bibr" rid="b46">46</xref>
 was used to simulate error-prone reads from genomes at the tips of the tree, which were then assembled by Velvet<xref ref-type="bibr" rid="b9">9</xref>
. DSM was used to count k-mers from these <italic>de novo</italic>
 assemblies.</p>
<p>To test the similarity of the population control to existing methods, 96 full <italic>S. pneumoniae</italic>
 ATCC 700669 genomes were evolved with ALF. Intergenic regions were also evolved using Dawg<xref ref-type="bibr" rid="b47">47</xref>
 at a previously determined rate<xref ref-type="bibr" rid="b48">48</xref>
. These were combined, and assemblies generated and k-mers counted as above. A distance matrix was created from 1% of the k-mers as described above, and a neighbour-joining tree produced from this.</p>
<p>The resulting tree was ranked against the true tree by counting one for each pair of isolates in each BAPS cluster, which had an isolate not in the same BAPS cluster as a descendent of their MRCA.</p>
</sec>
<sec disp-level="2"><title>Simulating phenotype based on genotype and odds ratio</title>
<p>Ratio of cases to controls in the population (<italic>S</italic>
<sub><italic>R</italic>
</sub>
) was set at 50% to represent antibiotic resistance, and a single variant (gene presence/absence or a SNP) was designated as causal. MAF in the population is set from the simulation, and odds ratio (OR) can be varied. The number of cases <italic>D</italic>
<sub><italic>E</italic>
</sub>
 is then the solution to a quadratic equation<xref ref-type="bibr" rid="b49">49</xref>
, which is related to probability of a sample being a case by:</p>
<p><disp-formula id="eq13"><inline-graphic id="d33e1286" xlink:href="ncomms12797-m13.jpg"></inline-graphic>
</disp-formula>
</p>
<p><disp-formula id="eq14"><inline-graphic id="d33e1289" xlink:href="ncomms12797-m14.jpg"></inline-graphic>
</disp-formula>
</p>
<p>The population was then randomly subsampled 100 times, with case and control status assigned for each run using these formulae. Power was defined by the proportion of runs that had at least one k-mer in the gene significantly associated with the phenotype.</p>
</sec>
<sec disp-level="2"><title>Code availability</title>
<p>SEER is available at <ext-link ext-link-type="uri" xlink:href="https://github.com/johnlees/seer">https://github.com/johnlees/seer</ext-link>
, DSM at <ext-link ext-link-type="uri" xlink:href="https://github.com/HIITMetagenomics/dsm-framework%20and%20fsm-lite">https://github.com/HIITMetagenomics/dsm-framework and fsm-lite</ext-link>
 at <ext-link ext-link-type="uri" xlink:href="https://github.com/nvalimak/fsm-lite">https://github.com/nvalimak/fsm-lite</ext-link>
. Scripts used to perform the simulations are available at <ext-link ext-link-type="uri" xlink:href="https://github.com/johnlees/bioinformatics">https://github.com/johnlees/bioinformatics</ext-link>
</p>
</sec>
<sec disp-level="2"><title>Data availability</title>
<p><italic>S. pyogenes</italic>
 sequence reads are available on the European Nucleotide Archive under study accession IDs <ext-link ext-link-type="EBI:ena" xlink:href="PRJEB2839">PRJEB2839</ext-link>
 (isolates from Fiji) and PRJEB3313 (isolates from Kilifi). Results from the <italic>S. pyogenes</italic>
 invasiveness GWAS can be found at: <ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.6084/m9.figshare.1613851">http://dx.doi.org/10.6084/m9.figshare.1613851</ext-link>
 and can be loaded directly into Phandango (<ext-link ext-link-type="uri" xlink:href="http://jameshadfield.github.io/phandango/">http://jameshadfield.github.io/phandango/</ext-link>
) to view the results.</p>
</sec>
</sec>
<sec disp-level="1"><title>Additional information</title>
<p><bold>How to cite this article:</bold>
 Lees, J. A. <italic>et al.</italic>
 Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes. <italic>Nat. Commun.</italic>
 7:12797 doi: 10.1038/ncomms12797 (2016).</p>
</sec>
<sec sec-type="supplementary-material" id="S1"><title>Supplementary Material</title>
<supplementary-material id="d33e18" content-type="local-data"><caption><title>Supplementary Information</title>
<p>Supplementary Figures 1-14, Supplementary Table 1 and Supplementary Methods.</p>
</caption>
<media xlink:href="ncomms12797-s1.pdf"></media>
</supplementary-material>
<supplementary-material id="d33e24" content-type="local-data"><caption><title>Peer Review File</title>
</caption>
<media xlink:href="ncomms12797-s2.pdf"></media>
</supplementary-material>
</sec>
</body>
<back><ack><p>We thank James Hadfield for his help in integrating SEER's output into the bacterial genome visualization tool Phandango, and Jeff Barrett and his group for helpful discussions on the relation of association studies in human genetics to prokaryotic genetics. This work was supported by Wellcome Trust grants 098051 and 107376/Z/15/Z, MRC grant 1365620, ERC grant 239784, Academy of Finland grant 287665 and COIN Centre of Excellence.</p>
</ack>
<ref-list><ref id="b1"><mixed-citation publication-type="journal"><name><surname>Falush</surname>
<given-names>D.</given-names>
</name>
<article-title>Bacterial genomics: Microbial GWAS coming of age</article-title>
. <source>Nat. Microbiol.</source>
<volume>1</volume>
, <fpage>16059</fpage>
 (<year>2016</year>
).<pub-id pub-id-type="pmid">27572652</pub-id>
</mixed-citation>
</ref>
<ref id="b2"><mixed-citation publication-type="journal"><name><surname>Chen</surname>
<given-names>P. E.</given-names>
</name>
 & <name><surname>Shapiro</surname>
<given-names>B. J.</given-names>
</name>
<article-title>The advent of genome-wide association studies for bacteria</article-title>
. <source>Curr. Opin. Microbiol.</source>
<volume>25</volume>
, <fpage>17</fpage>
–<lpage>24</lpage>
 (<year>2015</year>
).<pub-id pub-id-type="pmid">25835153</pub-id>
</mixed-citation>
</ref>
<ref id="b3"><mixed-citation publication-type="journal"><name><surname>Farhat</surname>
<given-names>M. R.</given-names>
</name>
<italic>et al.</italic>
<article-title>Genomic analysis identifies targets of convergent positive selection in drug-resistant Mycobacterium tuberculosis</article-title>
. <source>Nat. Genet.</source>
<volume>45</volume>
, <fpage>1183</fpage>
–<lpage>1189</lpage>
 (<year>2013</year>
).<pub-id pub-id-type="pmid">23995135</pub-id>
</mixed-citation>
</ref>
<ref id="b4"><mixed-citation publication-type="journal"><name><surname>Liu</surname>
<given-names>J. Z.</given-names>
</name>
 & <name><surname>Anderson</surname>
<given-names>C. A.</given-names>
</name>
<article-title>Genetic studies of Crohn's disease: past, present and future</article-title>
. <source>Best Pract. Res. Clin. Gastroenterol.</source>
<volume>28</volume>
, <fpage>373</fpage>
–<lpage>386</lpage>
 (<year>2014</year>
).<pub-id pub-id-type="pmid">24913378</pub-id>
</mixed-citation>
</ref>
<ref id="b5"><mixed-citation publication-type="journal"><name><surname>Sheppard</surname>
<given-names>S. K.</given-names>
</name>
<italic>et al.</italic>
<article-title>Genome-wide association study identifies vitamin B5 biosynthesis as a host specificity factor in Campylobacter</article-title>
. <source>Proc. Natl Acad. Sci. USA</source>
<volume>110</volume>
, <fpage>11923</fpage>
–<lpage>11927</lpage>
 (<year>2013</year>
).<pub-id pub-id-type="pmid">23818615</pub-id>
</mixed-citation>
</ref>
<ref id="b6"><mixed-citation publication-type="journal"><name><surname>Chewapreecha</surname>
<given-names>C.</given-names>
</name>
<italic>et al.</italic>
<article-title>Comprehensive identification of single nucleotide polymorphisms associated with beta-lactam resistance within pneumococcal mosaic genes</article-title>
. <source>PLoS Genet.</source>
<volume>10</volume>
, <fpage>e1004547</fpage>
 (<year>2014</year>
).<pub-id pub-id-type="pmid">25101644</pub-id>
</mixed-citation>
</ref>
<ref id="b7"><mixed-citation publication-type="journal"><name><surname>Laabei</surname>
<given-names>M.</given-names>
</name>
<italic>et al.</italic>
<article-title>Predicting the virulence of MRSA from its genome sequence</article-title>
. <source>Genome Res.</source>
<volume>24</volume>
, <fpage>839</fpage>
–<lpage>849</lpage>
 (<year>2014</year>
).<pub-id pub-id-type="pmid">24717264</pub-id>
</mixed-citation>
</ref>
<ref id="b8"><mixed-citation publication-type="journal"><name><surname>Weinert</surname>
<given-names>L.</given-names>
</name>
<article-title>a. <italic>et al.</italic>
 Genomic signatures of human and animal disease in the zoonotic pathogen Streptococcus suis</article-title>
. <source>Nat. Commun.</source>
<volume>6</volume>
, <fpage>6740</fpage>
 (<year>2015</year>
).<pub-id pub-id-type="pmid">25824154</pub-id>
</mixed-citation>
</ref>
<ref id="b9"><mixed-citation publication-type="journal"><name><surname>Zerbino</surname>
<given-names>D. R.</given-names>
</name>
 & <name><surname>Birney</surname>
<given-names>E.</given-names>
</name>
<article-title>Velvet: algorithms for de novo short read assembly using de Bruijn graphs</article-title>
. <source>Genome Res.</source>
<volume>18</volume>
, <fpage>821</fpage>
–<lpage>829</lpage>
 (<year>2008</year>
).<pub-id pub-id-type="pmid">18349386</pub-id>
</mixed-citation>
</ref>
<ref id="b10"><mixed-citation publication-type="journal"><name><surname>Gardner</surname>
<given-names>S. N.</given-names>
</name>
 & <name><surname>Hall</surname>
<given-names>B. G.</given-names>
</name>
<article-title>When whole-genome alignments just won't work: kSNP v2 software for alignment-free SNP discovery and phylogenetics of hundreds of microbial genomes</article-title>
. <source>PLoS ONE</source>
<volume>8</volume>
, <fpage>e81760</fpage>
 (<year>2013</year>
).<pub-id pub-id-type="pmid">24349125</pub-id>
</mixed-citation>
</ref>
<ref id="b11"><mixed-citation publication-type="journal"><name><surname>Ondov</surname>
<given-names>B. D.</given-names>
</name>
<italic>et al.</italic>
<article-title>Mash: fast genome and metagenome distance estimation using MinHash</article-title>
. <source>Genome Biol.</source>
<volume>17</volume>
, <fpage>1</fpage>
–<lpage>14</lpage>
 (<year>2016</year>
).<pub-id pub-id-type="pmid">26753840</pub-id>
</mixed-citation>
</ref>
<ref id="b12"><mixed-citation publication-type="journal"><name><surname>Evangelou</surname>
<given-names>E.</given-names>
</name>
 & <name><surname>Ioannidis</surname>
<given-names>J. P. A.</given-names>
</name>
<article-title>Meta-analysis methods for genome-wide association studies and beyond</article-title>
. <source>Nat. Rev. Genet.</source>
<volume>14</volume>
, <fpage>379</fpage>
–<lpage>389</lpage>
 (<year>2013</year>
).<pub-id pub-id-type="pmid">23657481</pub-id>
</mixed-citation>
</ref>
<ref id="b13"><mixed-citation publication-type="journal"><name><surname>Chewapreecha</surname>
<given-names>C.</given-names>
</name>
<italic>et al.</italic>
<article-title>Dense genomic sampling identifies highways of pneumococcal recombination</article-title>
. <source>Nat. Genet.</source>
<volume>46</volume>
, <fpage>305</fpage>
–<lpage>309</lpage>
 (<year>2014</year>
).<pub-id pub-id-type="pmid">24509479</pub-id>
</mixed-citation>
</ref>
<ref id="b14"><mixed-citation publication-type="journal"><name><surname>Rizk</surname>
<given-names>G.</given-names>
</name>
, <name><surname>Lavenier</surname>
<given-names>D.</given-names>
</name>
 & <name><surname>Chikhi</surname>
<given-names>R.</given-names>
</name>
<article-title>DSK: K-mer counting with very low memory usage</article-title>
. <source>Bioinformatics</source>
<volume>29</volume>
, <fpage>652</fpage>
–<lpage>653</lpage>
 (<year>2013</year>
).<pub-id pub-id-type="pmid">23325618</pub-id>
</mixed-citation>
</ref>
<ref id="b15"><mixed-citation publication-type="journal"><name><surname>Spain</surname>
<given-names>S. L.</given-names>
</name>
 & <name><surname>Barrett</surname>
<given-names>J. C.</given-names>
</name>
<article-title>Strategies for fine-mapping complex traits</article-title>
. <source>Hum. Mol. Genet.</source>
<volume>24</volume>
, <fpage>R111</fpage>
–<lpage>R119</lpage>
 (<year>2015</year>
).<pub-id pub-id-type="pmid">26157023</pub-id>
</mixed-citation>
</ref>
<ref id="b16"><mixed-citation publication-type="journal"><name><surname>Croucher</surname>
<given-names>N. J.</given-names>
</name>
<italic>et al.</italic>
<article-title>Role of conjugative elements in the evolution of the multidrug-resistant pandemic clone Streptococcus pneumoniaeSpain23F ST81</article-title>
. <source>J. Bacteriol.</source>
<volume>191</volume>
, <fpage>1480</fpage>
–<lpage>1489</lpage>
 (<year>2009</year>
).<pub-id pub-id-type="pmid">19114491</pub-id>
</mixed-citation>
</ref>
<ref id="b17"><mixed-citation publication-type="journal"><name><surname>Croucher</surname>
<given-names>N. J.</given-names>
</name>
<italic>et al.</italic>
<article-title>Rapid pneumococcal evolution in response to clinical interventions</article-title>
. <source>Science</source>
<volume>331</volume>
, <fpage>430</fpage>
–<lpage>434</lpage>
 (<year>2011</year>
).<pub-id pub-id-type="pmid">21273480</pub-id>
</mixed-citation>
</ref>
<ref id="b18"><mixed-citation publication-type="journal"><name><surname>Maskell</surname>
<given-names>J. P.</given-names>
</name>
, <name><surname>Sefton</surname>
<given-names>A. M.</given-names>
</name>
 & <name><surname>Hall</surname>
<given-names>L. M.</given-names>
</name>
<article-title>Multiple mutations modulate the function of dihydrofolate reductase in trimethoprim-resistant Streptococcus pneumoniae</article-title>
. <source>Antimicrob. Agents Chemother.</source>
<volume>45</volume>
, <fpage>1104</fpage>
–<lpage>1108</lpage>
 (<year>2001</year>
).<pub-id pub-id-type="pmid">11257022</pub-id>
</mixed-citation>
</ref>
<ref id="b19"><mixed-citation publication-type="journal"><name><surname>Ng</surname>
<given-names>P. C.</given-names>
</name>
 & <name><surname>Henikoff</surname>
<given-names>S.</given-names>
</name>
<article-title>SIFT: predicting amino acid changes that affect protein function</article-title>
. <source>Nucleic Acids Res.</source>
<volume>31</volume>
, <fpage>3812</fpage>
–<lpage>3814</lpage>
 (<year>2003</year>
).<pub-id pub-id-type="pmid">12824425</pub-id>
</mixed-citation>
</ref>
<ref id="b20"><mixed-citation publication-type="journal"><name><surname>Steer</surname>
<given-names>A. C.</given-names>
</name>
<italic>et al.</italic>
<article-title>emm and C-repeat region molecular typing of beta-hemolytic streptococci in a tropical country: Implications for vaccine development</article-title>
. <source>J. Clin. Microbiol.</source>
<volume>47</volume>
, <fpage>2502</fpage>
–<lpage>2509</lpage>
 (<year>2009</year>
).<pub-id pub-id-type="pmid">19515838</pub-id>
</mixed-citation>
</ref>
<ref id="b21"><mixed-citation publication-type="journal"><name><surname>Seale</surname>
<given-names>A. C.</given-names>
</name>
<italic>et al.</italic>
<article-title>Invasive Group A Streptococcus Infection among Children, Rural Kenya</article-title>
. <source>Emerg. Infect. Dis. J.</source>
<volume>22</volume>
, <fpage>224</fpage>
 (<year>2016</year>
).</mixed-citation>
</ref>
<ref id="b22"><mixed-citation publication-type="journal"><name><surname>Roberts</surname>
<given-names>A. P.</given-names>
</name>
 & <name><surname>Mullany</surname>
<given-names>P.</given-names>
</name>
<article-title>A modular master on the move: the Tn916 family of mobile genetic elements</article-title>
. <source>Trends Microbiol.</source>
<volume>17</volume>
, <fpage>251</fpage>
–<lpage>258</lpage>
 (<year>2009</year>
).<pub-id pub-id-type="pmid">19464182</pub-id>
</mixed-citation>
</ref>
<ref id="b23"><mixed-citation publication-type="journal"><name><surname>Dubnau</surname>
<given-names>D.</given-names>
</name>
<article-title>DNA Uptake in Bacteria</article-title>
. <source>Annu. Rev. Microbiol.</source>
<volume>53</volume>
, <fpage>217</fpage>
–<lpage>244</lpage>
 (<year>1999</year>
).<pub-id pub-id-type="pmid">10547691</pub-id>
</mixed-citation>
</ref>
<ref id="b24"><mixed-citation publication-type="journal"><name><surname>Lefébure</surname>
<given-names>T.</given-names>
</name>
 & <name><surname>Stanhope</surname>
<given-names>M. J.</given-names>
</name>
<article-title>Evolution of the core and pan-genome of Streptococcus: positive selection, recombination, and genome composition</article-title>
. <source>Genome Biol.</source>
<volume>8</volume>
, <fpage>R71</fpage>
 (<year>2007</year>
).<pub-id pub-id-type="pmid">17475002</pub-id>
</mixed-citation>
</ref>
<ref id="b25"><mixed-citation publication-type="journal"><name><surname>Raeder</surname>
<given-names>R.</given-names>
</name>
 & <name><surname>Boyle</surname>
<given-names>M. D.</given-names>
</name>
<article-title>Association between expression of immunoglobulin G-binding proteins by group A streptococci and virulence in a mouse skin infection model</article-title>
. <source>Infect. Immun.</source>
<volume>61</volume>
, <fpage>1378</fpage>
–<lpage>1384</lpage>
 (<year>1993</year>
).<pub-id pub-id-type="pmid">8454339</pub-id>
</mixed-citation>
</ref>
<ref id="b26"><mixed-citation publication-type="journal"><name><surname>Raeder</surname>
<given-names>R.</given-names>
</name>
 & <name><surname>Boyle</surname>
<given-names>M. D.</given-names>
</name>
<article-title>Analysis of immunoglobulin G-binding-protein expression by invasive isolates of Streptococcus pyogenes</article-title>
. <source>Clin. Diagn. Lab. Immunol.</source>
<volume>2</volume>
, <fpage>484</fpage>
–<lpage>486</lpage>
 (<year>1995</year>
).<pub-id pub-id-type="pmid">7583929</pub-id>
</mixed-citation>
</ref>
<ref id="b27"><mixed-citation publication-type="journal"><name><surname>Smith</surname>
<given-names>T. C.</given-names>
</name>
, <name><surname>Sledjeski</surname>
<given-names>D. D.</given-names>
</name>
 & <name><surname>Boyle</surname>
<given-names>M. D. P.</given-names>
</name>
<article-title>Streptococcus pyogenes Infection in Mouse Skin Leads to a Time-Dependent Up-Regulation of Protein H Expression</article-title>
. <source>Infect. Immun.</source>
<volume>71</volume>
, <fpage>6079</fpage>
–<lpage>6082</lpage>
 (<year>2003</year>
).<pub-id pub-id-type="pmid">14500534</pub-id>
</mixed-citation>
</ref>
<ref id="b28"><mixed-citation publication-type="journal"><name><surname>Smith</surname>
<given-names>T. C.</given-names>
</name>
, <name><surname>Sledjeski</surname>
<given-names>D. D.</given-names>
</name>
 & <name><surname>Boyle</surname>
<given-names>M. D. P.</given-names>
</name>
<article-title>Regulation of protein H expression in M1 serotype isolates of Streptococcus pyogenes</article-title>
. <source>FEMS Microbiol. Lett.</source>
<volume>219</volume>
, <fpage>9</fpage>
–<lpage>15</lpage>
 (<year>2003</year>
).<pub-id pub-id-type="pmid">12594016</pub-id>
</mixed-citation>
</ref>
<ref id="b29"><mixed-citation publication-type="journal"><name><surname>Walker</surname>
<given-names>M. J.</given-names>
</name>
<italic>et al.</italic>
<article-title>Disease manifestations and pathogenic mechanisms of group A Streptococcus</article-title>
. <source>Clin. Microbiol. Rev.</source>
<volume>27</volume>
, <fpage>264</fpage>
–<lpage>301</lpage>
 (<year>2014</year>
).<pub-id pub-id-type="pmid">24696436</pub-id>
</mixed-citation>
</ref>
<ref id="b30"><mixed-citation publication-type="journal"><name><surname>Välimäki</surname>
<given-names>N.</given-names>
</name>
 & <name><surname>Puglisi</surname>
<given-names>S.</given-names>
</name>
 in <source>Algorithms in Bioinformatics SE - 35</source>
 Vol. 7534 (eds Raphael B., Tang J. <fpage>441</fpage>
–<lpage>452</lpage>
Springer (<year>2012</year>
).</mixed-citation>
</ref>
<ref id="b31"><mixed-citation publication-type="journal"><name><surname>Seth</surname>
<given-names>S.</given-names>
</name>
, <name><surname>Välimäki</surname>
<given-names>N.</given-names>
</name>
, <name><surname>Kaski</surname>
<given-names>S.</given-names>
</name>
 & <name><surname>Honkela</surname>
<given-names>A.</given-names>
</name>
<article-title>Exploration and retrieval of whole-metagenome sequencing samples</article-title>
. <source>Bioinformatics</source>
<volume>30</volume>
, <fpage>16</fpage>
 (<year>2014</year>
).</mixed-citation>
</ref>
<ref id="b32"><mixed-citation publication-type="journal"><name><surname>Gog</surname>
<given-names>S.</given-names>
</name>
, <name><surname>Beller</surname>
<given-names>T.</given-names>
</name>
, <name><surname>Moffat</surname>
<given-names>A.</given-names>
</name>
 & <name><surname>Petri</surname>
<given-names>M.</given-names>
</name>
 in <source>Experimental Algorithms SE - 28</source>
 eds Gudmundsson J., Katajainen J. <fpage>326</fpage>
–<lpage>337</lpage>
Springer International Publishing (<year>2014</year>
).</mixed-citation>
</ref>
<ref id="b33"><mixed-citation publication-type="journal"><name><surname>Price</surname>
<given-names>A. L.</given-names>
</name>
<italic>et al.</italic>
<article-title>Principal components analysis corrects for stratification in genome-wide association studies</article-title>
. <source>Nat. Genet.</source>
<volume>38</volume>
, <fpage>904</fpage>
–<lpage>909</lpage>
 (<year>2006</year>
).<pub-id pub-id-type="pmid">16862161</pub-id>
</mixed-citation>
</ref>
<ref id="b34"><mixed-citation publication-type="journal"><name><surname>Chengsong</surname>
<given-names>Z.</given-names>
</name>
 & <name><surname>Jianming</surname>
<given-names>Y.</given-names>
</name>
<article-title>Nonmetric multidimensional scaling corrects for population structure in association mapping with different sample types</article-title>
. <source>Genetics</source>
<volume>182</volume>
, <fpage>875</fpage>
–<lpage>888</lpage>
 (<year>2009</year>
).<pub-id pub-id-type="pmid">19414565</pub-id>
</mixed-citation>
</ref>
<ref id="b35"><mixed-citation publication-type="other"><name><surname>Tasoulis</surname>
<given-names>S.</given-names>
</name>
<italic>et al.</italic>
 in <italic>2014 IEEE International Conference on Big Data (Big Data)</italic>
 675–682 (Washington, DC, USA, 2014).</mixed-citation>
</ref>
<ref id="b36"><mixed-citation publication-type="journal"><name><surname>Cheng</surname>
<given-names>L.</given-names>
</name>
, <name><surname>Connor</surname>
<given-names>T. R.</given-names>
</name>
, <name><surname>Sirén</surname>
<given-names>J.</given-names>
</name>
, <name><surname>Aanensen</surname>
<given-names>D. M.</given-names>
</name>
 & <name><surname>Corander</surname>
<given-names>J.</given-names>
</name>
<article-title>Hierarchical and spatially explicit clustering of DNA sequences with BAPS software</article-title>
. <source>Mol. Biol. Evol.</source>
<volume>30</volume>
, <fpage>1224</fpage>
–<lpage>1228</lpage>
 (<year>2013</year>
).<pub-id pub-id-type="pmid">23408797</pub-id>
</mixed-citation>
</ref>
<ref id="b37"><mixed-citation publication-type="journal"><name><surname>Heinze</surname>
<given-names>G.</given-names>
</name>
 & <name><surname>Schemper</surname>
<given-names>M.</given-names>
</name>
<article-title>A solution to the problem of separation in logistic regression</article-title>
. <source>Stat. Med.</source>
<volume>21</volume>
, <fpage>2409</fpage>
–<lpage>2419</lpage>
 (<year>2002</year>
).<pub-id pub-id-type="pmid">12210625</pub-id>
</mixed-citation>
</ref>
<ref id="b38"><mixed-citation publication-type="journal"><name><surname>Ford</surname>
<given-names>C. B.</given-names>
</name>
<italic>et al.</italic>
<article-title>Mycobacterium tuberculosis mutation rate estimates from different lineages predict substantial differences in the emergence of drug-resistant tuberculosis</article-title>
. <source>Nat. Genet.</source>
<volume>45</volume>
, <fpage>784</fpage>
–<lpage>790</lpage>
 (<year>2013</year>
).<pub-id pub-id-type="pmid">23749189</pub-id>
</mixed-citation>
</ref>
<ref id="b39"><mixed-citation publication-type="journal"><name><surname>Sanderson</surname>
<given-names>C.</given-names>
</name>
 & <name><surname>Curtin</surname>
<given-names>R.</given-names>
</name>
<article-title>Armadillo: a template-based C++ library for linear algebra</article-title>
. <source>JOSS</source>
<ext-link ext-link-type="uri" xlink:href="http://joss.theoj.org/papers/10.21105/joss.00026">http://joss.theoj.org/papers/10.21105/joss.00026</ext-link>
 (<year>2016</year>
).</mixed-citation>
</ref>
<ref id="b40"><mixed-citation publication-type="journal"><name><surname>King</surname>
<given-names>D. E.</given-names>
</name>
<article-title>Dlib-ml: A Machine Learning Toolkit</article-title>
. <source>J. Mach. Learn. Res.</source>
<volume>10</volume>
, <fpage>1755</fpage>
–<lpage>1758</lpage>
 (<year>2009</year>
).</mixed-citation>
</ref>
<ref id="b41"><mixed-citation publication-type="journal"><name><surname>Kent</surname>
<given-names>W. J.</given-names>
</name>
<article-title>BLAT—The BLAST-Like Alignment Tool</article-title>
. <source>Genome Res.</source>
<volume>12</volume>
, <fpage>656</fpage>
–<lpage>664</lpage>
 (<year>2002</year>
).<pub-id pub-id-type="pmid">11932250</pub-id>
</mixed-citation>
</ref>
<ref id="b42"><mixed-citation publication-type="journal"><name><surname>Li</surname>
<given-names>H.</given-names>
</name>
<article-title>A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data</article-title>
. <source>Bioinformatics</source>
<volume>27</volume>
, <fpage>2987</fpage>
–<lpage>2993</lpage>
 (<year>2011</year>
).<pub-id pub-id-type="pmid">21903627</pub-id>
</mixed-citation>
</ref>
<ref id="b43"><mixed-citation publication-type="journal"><name><surname>Cingolani</surname>
<given-names>P.</given-names>
</name>
<italic>et al.</italic>
<article-title>A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3</article-title>
. <source>Fly (Austin)</source>
<volume>6</volume>
, <fpage>1</fpage>
–<lpage>13</lpage>
 (<year>2012</year>
).</mixed-citation>
</ref>
<ref id="b44"><mixed-citation publication-type="journal"><name><surname>Dalquen</surname>
<given-names>D. a</given-names>
</name>
, <name><surname>Anisimova</surname>
<given-names>M.</given-names>
</name>
, <name><surname>Gonnet</surname>
<given-names>G. H.</given-names>
</name>
 & <name><surname>Dessimoz</surname>
<given-names>C.</given-names>
</name>
<article-title>ALF–a simulation framework for genome evolution</article-title>
. <source>Mol. Biol. Evol.</source>
<volume>29</volume>
, <fpage>1115</fpage>
–<lpage>1123</lpage>
 (<year>2012</year>
).<pub-id pub-id-type="pmid">22160766</pub-id>
</mixed-citation>
</ref>
<ref id="b45"><mixed-citation publication-type="journal"><name><surname>Chen</surname>
<given-names>J. Q.</given-names>
</name>
<italic>et al.</italic>
<article-title>Variation in the ratio of nucleotide substitution and indel rates across genomes in mammals and bacteria</article-title>
. <source>Mol. Biol. Evol.</source>
<volume>26</volume>
, <fpage>1523</fpage>
–<lpage>1531</lpage>
 (<year>2009</year>
).<pub-id pub-id-type="pmid">19329651</pub-id>
</mixed-citation>
</ref>
<ref id="b46"><mixed-citation publication-type="journal"><name><surname>Hu</surname>
<given-names>X.</given-names>
</name>
<italic>et al.</italic>
<article-title>pIRS: Profile-based Illumina pair-end reads simulator</article-title>
. <source>Bioinformatics</source>
<volume>28</volume>
, <fpage>1533</fpage>
–<lpage>1535</lpage>
 (<year>2012</year>
).<pub-id pub-id-type="pmid">22508794</pub-id>
</mixed-citation>
</ref>
<ref id="b47"><mixed-citation publication-type="journal"><name><surname>Cartwright</surname>
<given-names>R. a.</given-names>
</name>
<article-title>DNA assembly with gaps (Dawg): Simulating sequence evolution</article-title>
. <source>Bioinformatics</source>
<volume>21</volume>
, <fpage>31</fpage>
–<lpage>38</lpage>
 (<year>2005</year>
).<pub-id pub-id-type="pmid">15333453</pub-id>
</mixed-citation>
</ref>
<ref id="b48"><mixed-citation publication-type="journal"><name><surname>Kosiol</surname>
<given-names>C.</given-names>
</name>
, <name><surname>Holmes</surname>
<given-names>I.</given-names>
</name>
 & <name><surname>Goldman</surname>
<given-names>N.</given-names>
</name>
<article-title>An empirical codon model for protein sequence evolution</article-title>
. <source>Mol. Biol. Evol.</source>
<volume>24</volume>
, <fpage>1464</fpage>
–<lpage>1479</lpage>
 (<year>2007</year>
).<pub-id pub-id-type="pmid">17400572</pub-id>
</mixed-citation>
</ref>
<ref id="b49"><mixed-citation publication-type="journal"><name><surname>Newman</surname>
<given-names>S. C.</given-names>
</name>
 in <source>Biostatistical Methods in Epidemiology</source>
<fpage>329</fpage>
–<lpage>330</lpage>
John Wiley & Sons, Inc. (<year>2003</year>
).</mixed-citation>
</ref>
</ref-list>
<fn-group><fn><p><bold>Author contributions</bold>
 J.A.L.—designed methods, performed analysis and wrote manuscript. M.V.—designed methods, performed analysis and wrote manuscript. N.V.—Participated in method design, edited manuscript. S.R.H.—interpretation and preparation of <italic>S. pyogenes</italic>
 data. C.C.—prepared genetic and metadata from Maela isolates. N.J.C.—helped with interpretation of antibiotic resistance elements, edited manuscript. P.M.—participated in method design and edited manuscript. A.H.—participated in method design, edited manuscript. M.R.D.—analysis of <italic>S. pyogenes</italic>
 data and edited the manuscript. A.C.S.—collection of <italic>S. pyogenes</italic>
 isolates from Fiji, edited the manuscript. S.Y.C.T.—culturing and extraction of <italic>S. pyogenes</italic>
 isolates from Fiji and edited the manuscript. J.P.—advised on microbiological interpretation, edited the manuscript. S.D.B.—advised on microbiological interpretation and edited the manuscript. J.C.—designed method, performed analysis and wrote the manuscript.</p>
</fn>
</fn-group>
</back>
<floats-group><fig id="f1"><label>Figure 1</label>
<caption><title>Power to find associations versus number of samples.</title>
<p>Using simulations and subsamples of the population as described in the methods, power for (<bold>a</bold>
) detecting gene presence/absence at different odds ratios (<bold>b</bold>
) using all informative k-mers versus a single length (<bold>c</bold>
) detecting k-mers near, in the correct gene, or containing the causal variant for trimethoprim resistance. All curves are logistic fits to the mean power over 100 subsamples.</p>
</caption>
<graphic xlink:href="ncomms12797-f1"></graphic>
</fig>
<fig id="f2"><label>Figure 2</label>
<caption><title>Fine mapping trimethoprim resistance.</title>
<p>The locus pictured contains 72 significant k-mers, the most of any gene cluster. Coverage over the locus is pictured at the bottom of the figure. Shown above the genes are high-quality missense SNPs, plotted using their <italic>P</italic>
 value for affecting protein function as predicted by SIFT. Scale bar is 200 base pairs.</p>
</caption>
<graphic xlink:href="ncomms12797-f2"></graphic>
</fig>
<table-wrap position="float" id="t1"><label>Table 1</label>
<caption><title>K-mers associated with antibiotic resistance.</title>
</caption>
<table frame="hsides" rules="groups" border="1"><colgroup><col align="left"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="left"></col>
<col align="left"></col>
</colgroup>
<thead valign="bottom"><tr><th rowspan="2" align="left" valign="top" charoff="50"><bold>Antibiotic</bold>
</th>
<th rowspan="2" align="center" valign="top" charoff="50"><bold>Resistant samples</bold>
</th>
<th colspan="4" align="center" valign="top" charoff="50"><bold>Number of significant k-mers</bold>
<hr></hr>
</th>
</tr>
<tr><th align="center" valign="top" charoff="50"><bold>Total</bold>
</th>
<th align="center" valign="top" charoff="50"><bold>Mapped to reference</bold>
</th>
<th align="left" valign="top" charoff="50"><bold>Highest coverage annotation</bold>
</th>
<th align="left" valign="top" charoff="50"><bold>Causal element</bold>
</th>
</tr>
</thead>
<tbody valign="top"><tr><td align="left" valign="top" charoff="50">Chloramphenicol</td>
<td align="center" valign="top" charoff="50">204 (7%)</td>
<td align="center" valign="top" charoff="50">1,526</td>
<td align="center" valign="top" charoff="50">1,526</td>
<td align="left" valign="top" charoff="50">1,508—ICE</td>
<td align="left" valign="top" charoff="50">166—<italic>cat</italic>
</td>
</tr>
<tr><td align="left" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="left" valign="top" charoff="50">288—ORF (UniParc B8ZK82)</td>
<td align="left" valign="top" charoff="50"> </td>
</tr>
<tr><td align="left" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="left" valign="top" charoff="50">206—<italic>rep</italic>
</td>
<td align="left" valign="top" charoff="50"> </td>
</tr>
<tr><td align="left" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="left" valign="top" charoff="50"><bold>166—<italic>cat</italic>
</bold>
</td>
<td align="left" valign="top" charoff="50"> </td>
</tr>
<tr><td align="left" valign="top" charoff="50">Erythromycin</td>
<td align="center" valign="top" charoff="50">803 (26%)</td>
<td align="center" valign="top" charoff="50">1,154</td>
<td align="center" valign="top" charoff="50">112</td>
<td align="left" valign="top" charoff="50">10—permease (UniParc B8ZKV5)</td>
<td align="left" valign="top" charoff="50">4—mega element</td>
</tr>
<tr><td align="left" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="left" valign="top" charoff="50">8—<italic>prfC</italic>
</td>
<td align="left" valign="top" charoff="50">2—<italic>mef</italic>
</td>
</tr>
<tr><td align="left" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="left" valign="top" charoff="50">6—<italic>gatA</italic>
</td>
<td align="left" valign="top" charoff="50">2—omega element</td>
</tr>
<tr><td align="left" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="left" valign="top" charoff="50">4—ICE</td>
<td align="left" valign="top" charoff="50"> </td>
</tr>
<tr><td align="left" valign="top" charoff="50">β−lactams</td>
<td align="center" valign="top" charoff="50">1,563 (51%)</td>
<td align="center" valign="top" charoff="50">23,876</td>
<td align="center" valign="top" charoff="50">17,453</td>
<td align="left" valign="top" charoff="50">381—ICE</td>
<td align="left" valign="top" charoff="50">47—<italic>pbp2x</italic>
</td>
</tr>
<tr><td align="left" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="left" valign="top" charoff="50">145—prophage MM1</td>
<td align="left" valign="top" charoff="50">20—<italic>pbp2b</italic>
</td>
</tr>
<tr><td align="left" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="left" valign="top" charoff="50">50—SPN23F15110 (UniParc B8ZLE7)</td>
<td align="left" valign="top" charoff="50">8—<italic>pbp1a</italic>
</td>
</tr>
<tr><td align="left" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="left" valign="top" charoff="50">49—ICE <italic>orf16</italic>
</td>
<td align="left" valign="top" charoff="50"> </td>
</tr>
<tr><td align="left" valign="top" charoff="50">Tetracycline</td>
<td align="center" valign="top" charoff="50">1,958 (64%)</td>
<td align="center" valign="top" charoff="50">962</td>
<td align="center" valign="top" charoff="50">962</td>
<td align="left" valign="top" charoff="50">962—ICE</td>
<td align="left" valign="top" charoff="50">96—<italic>tetM</italic>
</td>
</tr>
<tr><td align="left" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="left" valign="top" charoff="50">136—ICE <italic>orf16</italic>
</td>
<td align="left" valign="top" charoff="50"> </td>
</tr>
<tr><td align="left" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="left" valign="top" charoff="50">121—ICE <italic>orf15</italic>
</td>
<td align="left" valign="top" charoff="50"> </td>
</tr>
<tr><td align="left" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="center" valign="top" charoff="50"> </td>
<td align="left" valign="top" charoff="50"><bold>96—<italic>tetM</italic>
</bold>
</td>
<td align="left" valign="top" charoff="50"> </td>
</tr>
<tr><td align="left" valign="top" charoff="50">Trimethoprim</td>
<td align="center" valign="top" charoff="50">2,553 (83%)</td>
<td align="center" valign="top" charoff="50">2,639</td>
<td align="center" valign="top" charoff="50">210</td>
<td align="left" valign="top" charoff="50"><bold>21—<italic>dyr</italic>
</bold>
</td>
<td align="left" valign="top" charoff="50">21—<italic>dyr</italic>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot><fn id="t1-fn1"><p>ICE, integrative conjugative element</p>
</fn>
<fn id="t1-fn2"><p>Results from SEER for antibiotic resistance binary outcome on a population of 3069 <italic>S. pneumoniae</italic>
. Significant k-mers are first interpreted by mapping to the ATCC 700669 reference genome. Up to the first four highest covered annotations are shown, and if the known mechanism is amongst these it is highlighted in bold. The ICE is the top hit in three analyses, as it carries multiple drug-resistance elements and is commonly found in multi-drug resistant strains<xref ref-type="bibr" rid="b16">16</xref>
. The distribution of phenotype across the phylogeny is shown in <xref ref-type="supplementary-material" rid="S1">Supplementary Fig. 4</xref>
.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</floats-group>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000142  | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000142  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration MERS

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri