Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A novel alignment-free method for detection of lateral genetic transfer based on TF-IDF

Identifieur interne : 000146 ( Pmc/Corpus ); précédent : 000145; suivant : 000147

A novel alignment-free method for detection of lateral genetic transfer based on TF-IDF

Auteurs : Yingnan Cong ; Yao-Ban Chan ; Mark A. Ragan

Source :

RBID : PMC:4958984

Abstract

Lateral genetic transfer (LGT) plays an important role in the evolution of microbes. Existing computational methods for detecting genomic regions of putative lateral origin scale poorly to large data. Here, we propose a novel method based on TF-IDF (Term Frequency-Inverse Document Frequency) statistics to detect not only regions of lateral origin, but also their origin and direction of transfer, in sets of hierarchically structured nucleotide or protein sequences. This approach is based on the frequency distributions of k-mers in the sequences. If a set of contiguous k-mers appears sufficiently more frequently in another phyletic group than in its own, we infer that they have been transferred from the first group to the second. We performed rigorous tests of TF-IDF using simulated and empirical datasets. With the simulated data, we tested our method under different parameter settings for sequence length, substitution rate between and within groups and post-LGT, deletion rate, length of transferred region and k size, and found that we can detect LGT events with high precision and recall. Our method performs better than an established method, ALFY, which has high recall but low precision. Our method is efficient, with runtime increasing approximately linearly with sequence length.


Url:
DOI: 10.1038/srep30308
PubMed: 27453035
PubMed Central: 4958984

Links to Exploration step

PMC:4958984

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">A novel alignment-free method for detection of lateral genetic transfer based on TF-IDF</title>
<author>
<name sortKey="Cong, Yingnan" sort="Cong, Yingnan" uniqKey="Cong Y" first="Yingnan" last="Cong">Yingnan Cong</name>
<affiliation>
<nlm:aff id="a1">
<institution>Institute for Molecular Bioscience and ARC Centre of Excellence in Bioinformatics, The University of Queensland</institution>
, St Lucia, Brisbane, QLD 4072,
<country>Australia</country>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Chan, Yao Ban" sort="Chan, Yao Ban" uniqKey="Chan Y" first="Yao-Ban" last="Chan">Yao-Ban Chan</name>
<affiliation>
<nlm:aff id="a2">
<institution>School of Mathematics and Statistics, The University of Melbourne</institution>
, Parkville, Melbourne, VIC 3010,
<country>Australia</country>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ragan, Mark A" sort="Ragan, Mark A" uniqKey="Ragan M" first="Mark A." last="Ragan">Mark A. Ragan</name>
<affiliation>
<nlm:aff id="a1">
<institution>Institute for Molecular Bioscience and ARC Centre of Excellence in Bioinformatics, The University of Queensland</institution>
, St Lucia, Brisbane, QLD 4072,
<country>Australia</country>
</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">27453035</idno>
<idno type="pmc">4958984</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4958984</idno>
<idno type="RBID">PMC:4958984</idno>
<idno type="doi">10.1038/srep30308</idno>
<date when="2016">2016</date>
<idno type="wicri:Area/Pmc/Corpus">000146</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000146</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">A novel alignment-free method for detection of lateral genetic transfer based on TF-IDF</title>
<author>
<name sortKey="Cong, Yingnan" sort="Cong, Yingnan" uniqKey="Cong Y" first="Yingnan" last="Cong">Yingnan Cong</name>
<affiliation>
<nlm:aff id="a1">
<institution>Institute for Molecular Bioscience and ARC Centre of Excellence in Bioinformatics, The University of Queensland</institution>
, St Lucia, Brisbane, QLD 4072,
<country>Australia</country>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Chan, Yao Ban" sort="Chan, Yao Ban" uniqKey="Chan Y" first="Yao-Ban" last="Chan">Yao-Ban Chan</name>
<affiliation>
<nlm:aff id="a2">
<institution>School of Mathematics and Statistics, The University of Melbourne</institution>
, Parkville, Melbourne, VIC 3010,
<country>Australia</country>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ragan, Mark A" sort="Ragan, Mark A" uniqKey="Ragan M" first="Mark A." last="Ragan">Mark A. Ragan</name>
<affiliation>
<nlm:aff id="a1">
<institution>Institute for Molecular Bioscience and ARC Centre of Excellence in Bioinformatics, The University of Queensland</institution>
, St Lucia, Brisbane, QLD 4072,
<country>Australia</country>
</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Scientific Reports</title>
<idno type="eISSN">2045-2322</idno>
<imprint>
<date when="2016">2016</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>Lateral genetic transfer (LGT) plays an important role in the evolution of microbes. Existing computational methods for detecting genomic regions of putative lateral origin scale poorly to large data. Here, we propose a novel method based on TF-IDF (Term Frequency-Inverse Document Frequency) statistics to detect not only regions of lateral origin, but also their origin and direction of transfer, in sets of hierarchically structured nucleotide or protein sequences. This approach is based on the frequency distributions of
<italic>k</italic>
-mers in the sequences. If a set of contiguous
<italic>k</italic>
-mers appears sufficiently more frequently in another phyletic group than in its own, we infer that they have been transferred from the first group to the second. We performed rigorous tests of TF-IDF using simulated and empirical datasets. With the simulated data, we tested our method under different parameter settings for sequence length, substitution rate between and within groups and post-LGT, deletion rate, length of transferred region and
<italic>k</italic>
size, and found that we can detect LGT events with high precision and recall. Our method performs better than an established method, ALFY, which has high recall but low precision. Our method is efficient, with runtime increasing approximately linearly with sequence length.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Ochman, H" uniqKey="Ochman H">H. Ochman</name>
</author>
<author>
<name sortKey="Lawrence, J G" uniqKey="Lawrence J">J. G. Lawrence</name>
</author>
<author>
<name sortKey="Groisman, E A" uniqKey="Groisman E">E. A. Groisman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schmitt, R M" uniqKey="Schmitt R">R. M. Schmitt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Davies, J" uniqKey="Davies J">J. Davies</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Doolittle, W F" uniqKey="Doolittle W">W. F. Doolittle</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Martin, W" uniqKey="Martin W">W. Martin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Beiko, R G" uniqKey="Beiko R">R. G. Beiko</name>
</author>
<author>
<name sortKey="Harlow, T J" uniqKey="Harlow T">T. J. Harlow</name>
</author>
<author>
<name sortKey="Ragan, M A" uniqKey="Ragan M">M. A. Ragan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Raymond, J" uniqKey="Raymond J">J. Raymond</name>
</author>
<author>
<name sortKey="Siefert, J L" uniqKey="Siefert J">J. L. Siefert</name>
</author>
<author>
<name sortKey="Staples, C R" uniqKey="Staples C">C. R. Staples</name>
</author>
<author>
<name sortKey="Blankenship, R E" uniqKey="Blankenship R">R. E. Blankenship</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Thomas, C M" uniqKey="Thomas C">C. M. Thomas</name>
</author>
<author>
<name sortKey="Nielsen, K M" uniqKey="Nielsen K">K. M. Nielsen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Skippington, E" uniqKey="Skippington E">E. Skippington</name>
</author>
<author>
<name sortKey="Ragan, M A" uniqKey="Ragan M">M. A. Ragan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chan, C X" uniqKey="Chan C">C. X. Chan</name>
</author>
<author>
<name sortKey="Darling, A E" uniqKey="Darling A">A. E. Darling</name>
</author>
<author>
<name sortKey="Beiko, R G" uniqKey="Beiko R">R. G. Beiko</name>
</author>
<author>
<name sortKey="Ragan, M A" uniqKey="Ragan M">M. A. Ragan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ragan, M A" uniqKey="Ragan M">M. A. Ragan</name>
</author>
<author>
<name sortKey="Beiko, R G" uniqKey="Beiko R">R. G. Beiko</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lawrence, J G" uniqKey="Lawrence J">J. G. Lawrence</name>
</author>
<author>
<name sortKey="Ochman, H" uniqKey="Ochman H">H. Ochman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ragan, M A" uniqKey="Ragan M">M. A. Ragan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lawrence, J G" uniqKey="Lawrence J">J. G. Lawrence</name>
</author>
<author>
<name sortKey="Ochman, H" uniqKey="Ochman H">H. Ochman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Domazet Loso, M" uniqKey="Domazet Loso M">M. Domazet-Lošo</name>
</author>
<author>
<name sortKey="Haubold, B" uniqKey="Haubold B">B. Haubold</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Domazet Loso, M" uniqKey="Domazet Loso M">M. Domazet-Lošo</name>
</author>
<author>
<name sortKey="Haubold, B" uniqKey="Haubold B">B. Haubold</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Domazet Loso, M" uniqKey="Domazet Loso M">M. Domazet-Lošo</name>
</author>
<author>
<name sortKey="Haubold, B" uniqKey="Haubold B">B. Haubold</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Saitou, N" uniqKey="Saitou N">N. Saitou</name>
</author>
<author>
<name sortKey="Nei, M" uniqKey="Nei M">M. Nei</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Taniguchi, Y" uniqKey="Taniguchi Y">Y. Taniguchi</name>
</author>
<author>
<name sortKey="Yamada, Y" uniqKey="Yamada Y">Y. Yamada</name>
</author>
<author>
<name sortKey="Maruyama, O" uniqKey="Maruyama O">O. Maruyama</name>
</author>
<author>
<name sortKey="Kuhara, S" uniqKey="Kuhara S">S. Kuhara</name>
</author>
<author>
<name sortKey="Ikeda, D" uniqKey="Ikeda D">D. Ikeda</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gusfield, D" uniqKey="Gusfield D">D. Gusfield</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Luhn, H P" uniqKey="Luhn H">H. P. Luhn</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jones, K S" uniqKey="Jones K">K. S. Jones</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salton, G" uniqKey="Salton G">G. Salton</name>
</author>
<author>
<name sortKey="Buckley, C" uniqKey="Buckley C">C. Buckley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wu, H C" uniqKey="Wu H">H. C. Wu</name>
</author>
<author>
<name sortKey="Luk, R W P" uniqKey="Luk R">R. W. P. Luk</name>
</author>
<author>
<name sortKey="Wong, K F" uniqKey="Wong K">K. F. Wong</name>
</author>
<author>
<name sortKey="Kwok, K L" uniqKey="Kwok K">K. L. Kwok</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Holden, M T" uniqKey="Holden M">M. T. Holden</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hasegawa, M" uniqKey="Hasegawa M">M. Hasegawa</name>
</author>
<author>
<name sortKey="Kishino, H" uniqKey="Kishino H">H. Kishino</name>
</author>
<author>
<name sortKey="Yano, T" uniqKey="Yano T">T. Yano</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Felsenstein, J" uniqKey="Felsenstein J">J. Felsenstein</name>
</author>
<author>
<name sortKey="Churchill, G A" uniqKey="Churchill G">G. A. Churchill</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cong, Y" uniqKey="Cong Y">Y. Cong</name>
</author>
<author>
<name sortKey="Chan, Y B" uniqKey="Chan Y">Y.-b. Chan</name>
</author>
<author>
<name sortKey="Ragan, M A" uniqKey="Ragan M">M. A. Ragan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Popa, O" uniqKey="Popa O">O. Popa</name>
</author>
<author>
<name sortKey="Hazkani Covo, E" uniqKey="Hazkani Covo E">E. Hazkani-Covo</name>
</author>
<author>
<name sortKey="Landan, G" uniqKey="Landan G">G. Landan</name>
</author>
<author>
<name sortKey="Martin, W" uniqKey="Martin W">W. Martin</name>
</author>
<author>
<name sortKey="Dagan, T" uniqKey="Dagan T">T. Dagan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jain, R" uniqKey="Jain R">R. Jain</name>
</author>
<author>
<name sortKey="Rivera, M C" uniqKey="Rivera M">M. C. Rivera</name>
</author>
<author>
<name sortKey="Lake, J A" uniqKey="Lake J">J. A. Lake</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Robinson, D A" uniqKey="Robinson D">D. A. Robinson</name>
</author>
<author>
<name sortKey="Enright, M C" uniqKey="Enright M">M. C. Enright</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salton, G" uniqKey="Salton G">G. Salton</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salton, G" uniqKey="Salton G">G. Salton</name>
</author>
<author>
<name sortKey="Mcgill, M J" uniqKey="Mcgill M">M. J. McGill</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salton, G" uniqKey="Salton G">G. Salton</name>
</author>
<author>
<name sortKey="Mcgill, M J" uniqKey="Mcgill M">M. J. McGill</name>
</author>
<author>
<name sortKey="Sparck Jones, K" uniqKey="Sparck Jones K">K Sparck Jones</name>
</author>
<author>
<name sortKey="Willett, P" uniqKey="Willett P">P Willett</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salton, G" uniqKey="Salton G">G. Salton</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Joachims, T" uniqKey="Joachims T">T. Joachims</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zobel, J" uniqKey="Zobel J">J. Zobel</name>
</author>
<author>
<name sortKey="Moffat, A" uniqKey="Moffat A">A. Moffat</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Paltoglou, G" uniqKey="Paltoglou G">G. Paltoglou</name>
</author>
<author>
<name sortKey="Thelwall, M" uniqKey="Thelwall M">M. Thelwall</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salton, G" uniqKey="Salton G">G. Salton</name>
</author>
<author>
<name sortKey="Yang, C S" uniqKey="Yang C">C.-S. Yang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salton, G" uniqKey="Salton G">G. Salton</name>
</author>
<author>
<name sortKey="Yang, C S" uniqKey="Yang C">C.-S. Yang</name>
</author>
<author>
<name sortKey="Yu, C T" uniqKey="Yu C">C. T. Yu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nussinov, R" uniqKey="Nussinov R">R. Nussinov</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Koonin, E V" uniqKey="Koonin E">E. V. Koonin</name>
</author>
<author>
<name sortKey="Galperin, M Y" uniqKey="Galperin M">M. Y. Galperin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kruskal, J B" uniqKey="Kruskal J">J. B. Kruskal</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G. Marçais</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C. Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Greenfield, P" uniqKey="Greenfield P">P. Greenfield</name>
</author>
<author>
<name sortKey="Duesing, K" uniqKey="Duesing K">K. Duesing</name>
</author>
<author>
<name sortKey="Papanicolaou, A" uniqKey="Papanicolaou A">A. Papanicolaou</name>
</author>
<author>
<name sortKey="Bauer, D C" uniqKey="Bauer D">D. C. Bauer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chor, B" uniqKey="Chor B">B. Chor</name>
</author>
<author>
<name sortKey="Horn, D" uniqKey="Horn D">D. Horn</name>
</author>
<author>
<name sortKey="Goldman, N" uniqKey="Goldman N">N. Goldman</name>
</author>
<author>
<name sortKey="Levy, Y" uniqKey="Levy Y">Y. Levy</name>
</author>
<author>
<name sortKey="Massingham, T" uniqKey="Massingham T">T. Massingham</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Burden, C J" uniqKey="Burden C">C. J. Burden</name>
</author>
<author>
<name sortKey="Leopardi, P" uniqKey="Leopardi P">P. Leopardi</name>
</author>
<author>
<name sortKey="Foret, S" uniqKey="Foret S">S. Foret</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kurtz, S" uniqKey="Kurtz S">S. Kurtz</name>
</author>
<author>
<name sortKey="Narechania, A" uniqKey="Narechania A">A. Narechania</name>
</author>
<author>
<name sortKey="Stein, J C" uniqKey="Stein J">J. C. Stein</name>
</author>
<author>
<name sortKey="Ware, D" uniqKey="Ware D">D. Ware</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mantegna, R N" uniqKey="Mantegna R">R. N. Mantegna</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tsonis, A A" uniqKey="Tsonis A">A. A. Tsonis</name>
</author>
<author>
<name sortKey="Elsner, J B" uniqKey="Elsner J">J. B. Elsner</name>
</author>
<author>
<name sortKey="Tsonis, P A" uniqKey="Tsonis P">P. A. Tsonis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ragan, M A" uniqKey="Ragan M">M. A. Ragan</name>
</author>
<author>
<name sortKey="Lee, A R" uniqKey="Lee A">A. R. Lee</name>
</author>
<author>
<name sortKey="Dudley, T R" uniqKey="Dudley T">T. R. Dudley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Felsenstein, J" uniqKey="Felsenstein J">J. Felsenstein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Guibas, L J" uniqKey="Guibas L">L. J. Guibas</name>
</author>
<author>
<name sortKey="Sedgewick, R" uniqKey="Sedgewick R">R. Sedgewick</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dalquen, D A" uniqKey="Dalquen D">D. A. Dalquen</name>
</author>
<author>
<name sortKey="Anisimova, M" uniqKey="Anisimova M">M. Anisimova</name>
</author>
<author>
<name sortKey="Gonnet, G H" uniqKey="Gonnet G">G. H. Gonnet</name>
</author>
<author>
<name sortKey="Dessimoz, C" uniqKey="Dessimoz C">C. Dessimoz</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Sci Rep</journal-id>
<journal-id journal-id-type="iso-abbrev">Sci Rep</journal-id>
<journal-title-group>
<journal-title>Scientific Reports</journal-title>
</journal-title-group>
<issn pub-type="epub">2045-2322</issn>
<publisher>
<publisher-name>Nature Publishing Group</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">27453035</article-id>
<article-id pub-id-type="pmc">4958984</article-id>
<article-id pub-id-type="pii">srep30308</article-id>
<article-id pub-id-type="doi">10.1038/srep30308</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>A novel alignment-free method for detection of lateral genetic transfer based on TF-IDF</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Cong</surname>
<given-names>Yingnan</given-names>
</name>
<xref ref-type="aff" rid="a1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Chan</surname>
<given-names>Yao-ban</given-names>
</name>
<xref ref-type="aff" rid="a2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Ragan</surname>
<given-names>Mark A.</given-names>
</name>
<xref ref-type="corresp" rid="c1">a</xref>
<xref ref-type="aff" rid="a1">1</xref>
</contrib>
<aff id="a1">
<label>1</label>
<institution>Institute for Molecular Bioscience and ARC Centre of Excellence in Bioinformatics, The University of Queensland</institution>
, St Lucia, Brisbane, QLD 4072,
<country>Australia</country>
</aff>
<aff id="a2">
<label>2</label>
<institution>School of Mathematics and Statistics, The University of Melbourne</institution>
, Parkville, Melbourne, VIC 3010,
<country>Australia</country>
</aff>
</contrib-group>
<author-notes>
<corresp id="c1">
<label>a</label>
<email>m.ragan@uq.edu.au</email>
</corresp>
</author-notes>
<pub-date pub-type="epub">
<day>25</day>
<month>07</month>
<year>2016</year>
</pub-date>
<pub-date pub-type="collection">
<year>2016</year>
</pub-date>
<volume>6</volume>
<elocation-id>30308</elocation-id>
<history>
<date date-type="received">
<day>06</day>
<month>04</month>
<year>2016</year>
</date>
<date date-type="accepted">
<day>04</day>
<month>07</month>
<year>2016</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright © 2016, The Author(s)</copyright-statement>
<copyright-year>2016</copyright-year>
<copyright-holder>The Author(s)</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<pmc-comment>author-paid</pmc-comment>
<license-p>This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
</license-p>
</license>
</permissions>
<abstract>
<p>Lateral genetic transfer (LGT) plays an important role in the evolution of microbes. Existing computational methods for detecting genomic regions of putative lateral origin scale poorly to large data. Here, we propose a novel method based on TF-IDF (Term Frequency-Inverse Document Frequency) statistics to detect not only regions of lateral origin, but also their origin and direction of transfer, in sets of hierarchically structured nucleotide or protein sequences. This approach is based on the frequency distributions of
<italic>k</italic>
-mers in the sequences. If a set of contiguous
<italic>k</italic>
-mers appears sufficiently more frequently in another phyletic group than in its own, we infer that they have been transferred from the first group to the second. We performed rigorous tests of TF-IDF using simulated and empirical datasets. With the simulated data, we tested our method under different parameter settings for sequence length, substitution rate between and within groups and post-LGT, deletion rate, length of transferred region and
<italic>k</italic>
size, and found that we can detect LGT events with high precision and recall. Our method performs better than an established method, ALFY, which has high recall but low precision. Our method is efficient, with runtime increasing approximately linearly with sequence length.</p>
</abstract>
</article-meta>
</front>
<body>
<p>Many microbes can acquire DNA from their environment and incorporate it into their genome
<italic>via</italic>
processes of lateral genetic transfer (LGT; also known as horizontal gene transfer, HGT)
<xref ref-type="bibr" rid="b1">1</xref>
. Circumstantial evidence for LGT was first reported more than a century ago
<xref ref-type="bibr" rid="b2">2</xref>
, and the phenomenon gained widespread attention in the 1950s with the emergence and spread of multi-drug resistance in bacteria
<xref ref-type="bibr" rid="b3">3</xref>
. With the uptake of genome sequencing over the last two decades, it has become increasingly clear that LGT plays a central role in the evolution of microbial genomes
<xref ref-type="bibr" rid="b1">1</xref>
<xref ref-type="bibr" rid="b4">4</xref>
<xref ref-type="bibr" rid="b5">5</xref>
<xref ref-type="bibr" rid="b6">6</xref>
. LGT not only contributes to the spread of antibiotic resistance, but is also responsible for a range of metabolic innovations involving carbon and nitrogen metabolism, ion transport and other core processes
<xref ref-type="bibr" rid="b7">7</xref>
, which in turn can define microbial physiology and thus ecosystem function.</p>
<p>The recognised mechanisms of LGT (transformation, transduction and conjugation) can introduce exogenous regions of very different lengths, from short fragments to large chromosomal blocks
<xref ref-type="bibr" rid="b8">8</xref>
. Recombination need not be constrained by gene boundaries
<xref ref-type="bibr" rid="b9">9</xref>
, and there is little evidence to suggest that entire genes, or structurally based regions within genes, are privileged units of transfer
<xref ref-type="bibr" rid="b10">10</xref>
<xref ref-type="bibr" rid="b11">11</xref>
. In any event, genomic regions of lateral origin can be overwritten, wholly or in part, by subsequent LGT events. Thus microbial genomes can become mosaics, with regions of different lengths reflecting the history of LGT events, transfer mechanisms and donors in each lineage. Further, over time, sequence regions of lateral origin will evolve to become indistinguishable from the non-lateral background, a process known as amelioration
<xref ref-type="bibr" rid="b12">12</xref>
.</p>
<p>This complex biology presents challenges for the detection and delineation of genomic regions of lateral origin. As typically applied, approaches based on the topological comparison of inferred phylogenetic trees implicitly take genes (gene families) as the unit of analysis. Extensions that test for recombination breakpoints are computationally intensive, yet fail to identify the specific lineage(s) affected by transfer and/or subsequent overwriting. Directionality of transfer can also be difficult or impossible to determine by any phylogenetic approach. More broadly, computational methods are differentially sensitive to the extent of amelioration
<xref ref-type="bibr" rid="b13">13</xref>
<xref ref-type="bibr" rid="b14">14</xref>
. Considerable scope thus remains for the development of new methods that are sensitive, directional, scalable, informative on individual genomes or lineages, and do not require the units of analysis to be delineated
<italic>a priori</italic>
.</p>
<p>Alignment-free approaches to detect LGT at genome level have been developed in recent years. ALFY (ALignment-Free local homologY)
<xref ref-type="bibr" rid="b15">15</xref>
<xref ref-type="bibr" rid="b16">16</xref>
uses
<italic>Kr</italic>
<xref ref-type="bibr" rid="b17">17</xref>
based on
<italic>shustrings</italic>
(SHortest Unique subSTRINGS) to calculate pairwise evolutionary distances between genomes, which can then serve as input into a neighbor-joining algorithm
<xref ref-type="bibr" rid="b18">18</xref>
to compute a phylogenetic tree. Then ALFY compares the generated tree with a reference, inferring topological incongruence as instances of LGT.</p>
<p>Another alignment-free method for LGT detection is based on the so-called purity measure
<xref ref-type="bibr" rid="b19">19</xref>
. This is a concept from text mining, and is used to detect unusual regions of a string without recourse to domain knowledge. If most substrings of string
<italic>x</italic>
, which is itself a substring of string
<italic>T</italic>
, appear with the same frequency as
<italic>x</italic>
, then the purity value of
<italic>x</italic>
is high,
<italic>i.e.</italic>
subpatterns in
<italic>x</italic>
occur infrequently in
<italic>T</italic>
outside whole occurrences of
<italic>x</italic>
, as would be expected if
<italic>x</italic>
had arisen by LGT. Both of these alignment-free methods use suffix trees
<xref ref-type="bibr" rid="b20">20</xref>
for scalability on large sequence datasets. However, they consider only one target sequence (although ALFY incorporates a pairwise comparison between query and multiple subject sequences) and do not take into account any natural group structure of the dataset, whether taxonomic (a hierarchy of species, genera etc.), ecological or otherwise.</p>
<p>In this paper, we propose a novel alignment-free method for LGT detection based on concepts from TF-IDF (Term Frequency-Inverse Document Frequency). TF-IDF is a numerical statistic from document analysis that reflects the importance of a word (term) to a document within a collection or corpus, by comparing the frequency of a word in a document with its occurrence in other documents.</p>
<p>Term frequency (TF) is used to indicate the topic of a document
<xref ref-type="bibr" rid="b21">21</xref>
. The TF of term
<italic>t</italic>
in document
<italic>D</italic>
is simply the raw frequency of
<italic>t</italic>
in
<italic>D</italic>
, denoted by
<italic>tf(t, D)</italic>
. The inverse document frequency (IDF)
<xref ref-type="bibr" rid="b22">22</xref>
is used to distinguish a word from the prevalent vocabulary in the corpus. If
<italic>t</italic>
appears in
<italic>D</italic>
<sub>
<italic>t</italic>
</sub>
articles, then its IDF is
<italic>idf(t)</italic>
 = 
<italic>D</italic>
<sup>
<italic>*</italic>
</sup>
/
<italic>D</italic>
<sub>
<italic>t</italic>
</sub>
, where
<italic>D</italic>
<sup>
<italic>*</italic>
</sup>
is the number of all documents in the corpus. Thus a high IDF indicates that the term appears infrequently, and as such carries more importance for a specific article. Salton and Buckley combined the TF and IDF statistics into a single statistic that is widely used as a weighting factor in text mining and information retrieval
<xref ref-type="bibr" rid="b22">22</xref>
<xref ref-type="bibr" rid="b23">23</xref>
<xref ref-type="bibr" rid="b24">24</xref>
.</p>
<p>Here we apply concepts from TF-IDF to develop an algorithm to detect LGT events in microbial genomes. Using simulated datasets, we test this algorithm and compare its performance with ALFY on sets of sequences of different length, from the size of a single gene (1000 nucleotides) up to 300-fold longer, and evaluate its performance over
<italic>k</italic>
-mer length and a biologically relevant range of values for parameters including substitution rate between groups, within groups and post-LGT. We find that with appropriate parameter values, the algorithm performs with good precision and recall; furthermore, runtime increases approximately linearly with sequence length, and in most cases TF-IDF performs much better than ALFY
<xref ref-type="bibr" rid="b15">15</xref>
. We also apply this method to an empirical dataset composed of seven
<italic>Staphylococcus aureus</italic>
genomes, and recover putative regions of lateral origin that correspond to genes involved in transport, antibiotic resistance, pathogenicity and virulence. Our results are comparable with those found with ALFY, and include two genomic regions independently confirmed by Holden
<italic>et al</italic>
.
<xref ref-type="bibr" rid="b25">25</xref>
.</p>
<sec disp-level="1">
<title>Results</title>
<sec disp-level="2">
<title>Performance with different parameter values</title>
<p>As described in Methods, we varied branch length at three stages of the simulation process (variation between groups, variation within groups, and variation post-LGT) and examined the effect on precision and recall. The results are shown in
<xref ref-type="fig" rid="f1">Figs 1</xref>
,
<xref ref-type="fig" rid="f2">2</xref>
,
<xref ref-type="fig" rid="f3">3</xref>
,
<xref ref-type="fig" rid="f4">4</xref>
for simulations under the HYK85
<xref ref-type="bibr" rid="b26">26</xref>
model of sequence change; the corresponding plots for F84
<xref ref-type="bibr" rid="b27">27</xref>
are in the
<xref ref-type="supplementary-material" rid="S1">Supplementary file</xref>
. Since TF-IDF does not detect LGT between sequences within a group, for the comparison we ignore such regions that are detected by ALFY; and if an atypical region is equally predicted in several sequences of potential donor groups, we treat this result as a single prediction for the calculation of precision and recall.</p>
<p>
<xref ref-type="fig" rid="f1">Figure 1</xref>
shows that when variation between groups is less than 0.05, the average distance accumulated between groups is less than 15%; at this degree of between-group similarity, the precision of our TF-IDF method is low (less than 50%) because the high similarity makes lateral regions harder to distinguish in the recipient group. Precision increases to a high level when variation between groups is above 0.1. Recall is high throughout (approximately 90%) and is less affected by variation; however, at the shortest sequence length examined here (1000), some simulated LGT segments are less than 50 nt in length, too short to contain enough information to make them distinct. As a consequence, recall is significantly lower for this sequence length only.</p>
<p>The precision of ALFY is low, around 0.35, and stable across all branch lengths, but its recall is high. There are two reasons for this. Firstly, ALFY cannot infer the direction of transfer, and may correctly predict one transfer from donor to recipient, but then (erroneously) predict it again from recipient to donor, effectively halving its precision. In the accompanying article
<xref ref-type="bibr" rid="b28">28</xref>
we compare TF-IDF with another directional LGT inference approach
<xref ref-type="bibr" rid="b29">29</xref>
applied to genome-scale empirical data. Secondly, ALFY predicts all most-similar regions as lateral transfers without using a threshold to determine if the similarity is significant or not. As such, it is apparent that ALFY is a useful tool for determining areas which should be further studied for transferred segments, but as a stand-alone detector of LGT it is inferior to TF-IDF. For sequences of length 1000 nt, ALFY’s default sliding window size is too large, leading to reduced performance.</p>
<p>
<xref ref-type="fig" rid="f2">Figure 2</xref>
shows the effect of variation within groups on precision and recall. Here, the precision of TF-IDF increases as variation increases. As above, the sequences must be sufficiently dissimilar for the TF statistic to support a decision of LGT. Recall is high, and stable when the sequence length is ≥3000 nt. Again, at sequence length 1000, some short LGT events (<50 nt) are ignored, resulting in decreased recall. The precision of ALFY is stable for variation above 0.005, but again low. TF-IDF shows greater stability and better performance than ALFY in almost all cases, and increasingly outperforms it as the variation increases. As in
<xref ref-type="fig" rid="f1">Fig. 1</xref>
, ALFY displays better recall than TF-IDF at sequence lengths greater than 1000 nt, but the gap is not large. When the variation within groups is low and the sequence length is short (1000 nt), ALFY again fails to detect most LGT events, leading to extremely low recall (see
<xref ref-type="supplementary-material" rid="S1">Supplementary file</xref>
).</p>
<p>
<xref ref-type="fig" rid="f3">Figure 3</xref>
shows the performance of TF-IDF against variation post-LGT and deletion rate for sequences of length 300,000 nt. Plots for other sequence lengths are similar in nature and can be found in the
<xref ref-type="supplementary-material" rid="S1">Supplementary file</xref>
. As variation increases, both precision and (especially) recall decrease substantially, as substitutions progressively obscure the regions of lateral origin. When the branch length post-LGT reaches 0.05 (
<italic>i.e.</italic>
one nucleotide in ten is expected to have changed, as this is a two-level tree), almost all
<italic>k</italic>
-mers (for
<italic>k</italic>
 = 40) have been changed, whether in lateral regions or not. In this case, all
<italic>k</italic>
-mer based methods, including TF-IDF, will fail (and indeed, even alignment-based methods will struggle).</p>
<p>As the amount of deletion increases, precision remains stable and recall decreases slightly. Deletion can move an LGT segment within a sequence, or delete part (or parts) of it. Moving an LGT region does not change its
<italic>k</italic>
-mers, so this will not affect the performance of TF-IDF. Deletions within a lateral region affect only the immediately adjacent
<italic>k</italic>
-mers, with little effect on precision unless the region becomes so fragmented that
<italic>k</italic>
-mer counts are reduced to the point where they are ignored by TF-IDF, degrading the recall.</p>
<p>Precision and recall increase slightly with sequence length, but length does not appear to interact substantially with the substitution-rate parameters. Since there is no interaction between variation post-LGT and deletion (
<xref ref-type="fig" rid="f3">Fig. 3</xref>
), we can fix one of these parameters and vary the other.
<xref ref-type="fig" rid="f4">Figure 4</xref>
shows that for TF-IDF and ALFY, both precision and recall decrease as variation post-LGT increases. The precision of ALFY is worse than that of TF-IDF, but its recall is higher and more stable. When deletion is varied (
<xref ref-type="fig" rid="f5">Fig. 5</xref>
), precision is stable except at sequence length 1000, while recall decreases slightly for TF-IDF. As before, TF-IDF is more precise than ALFY, whereas ALFY exhibits higher recall (except at sequence length 1000).</p>
</sec>
<sec disp-level="2">
<title>
<italic>k</italic>
-mer size</title>
<p>
<italic>k</italic>
-mer size also affects the performance of TF-IDF. As shown in
<xref ref-type="fig" rid="f6">Fig. 6</xref>
, precision increases with
<italic>k</italic>
, but recall decreases. This effect is roughly consistent for every sequence length we examined. The two plots indicate that in this simulation, precision and recall are best balanced at
<italic>k</italic>
 = 40. Indeed, in our experience (as shown and unpublished)
<italic>k</italic>
 = 40 is a useful default setting, in the absence of conditions that argue otherwise. However, if LGT is sufficiently obscured by substitution such that nearly all
<italic>k</italic>
-mers are unique, TF-IDF will not be able to find sets of
<italic>k</italic>
-mers that appear frequently in distant groups, and no LGT will be predicted. In such cases, shorter
<italic>k</italic>
may give better performance. Note that larger
<italic>k</italic>
imposes a greater memory cost, and more computational time is spent indexing unique
<italic>k</italic>
-mers.</p>
</sec>
<sec disp-level="2">
<title>Computation time</title>
<p>
<xref ref-type="fig" rid="f7">Figure 7</xref>
compares computation time (walltime) for various sequence lengths
<italic>L</italic>
for ALFY and TF-IDF. All experiments were done on a virtual machine with a single AMD Opteron 2.3-GHz processor and 256 GB memory. As noted below, TF-IDF is expected to scale as
<italic>O</italic>
(
<italic>nL log U</italic>
), where
<italic>U</italic>
is the number of unique
<italic>k</italic>
-mers in the dataset.
<italic>U</italic>
is highly dependent on variation at all levels of the simulation, which also leads to variation of time consumption in each experiment; if the final sequences are sufficiently dissimilar, we expect
<italic>U</italic>
to increase as the number of possible
<italic>k</italic>
-mers in the dataset,
<italic>i.e.</italic>
as
<italic>nL</italic>
. Thus, we expect the time to have an
<italic>O</italic>
(
<italic>L log L</italic>
) dependence on
<italic>L</italic>
, and this is verified in
<xref ref-type="fig" rid="f7">Fig. 7</xref>
; the slope of the fitted line is 1.07. For ALFY, the time consumption is
<italic>O</italic>
(
<italic>n</italic>
<sup>2</sup>
<italic>L</italic>
) for detecting LGTs between all sequences in a dataset. In a dataset with tens of sequences or more, ALFY will take much longer than TF-IDF, and this is shown in
<xref ref-type="fig" rid="f7">Fig. 7</xref>
.</p>
<p>
<xref ref-type="fig" rid="f8">Figure 8</xref>
shows how walltime depends on
<italic>U</italic>
. As above, we expect time divided by
<italic>L</italic>
to have a linear relationship with
<italic>log U</italic>
, and this is clearly shown.</p>
</sec>
<sec disp-level="2">
<title>Analysis of an empirical dataset</title>
<p>We also tested our algorithm on an empirical dataset that had previously been examined by the developers of ALFY
<xref ref-type="bibr" rid="b15">15</xref>
. We used a subset of their dataset, seven genomes of
<italic>Staphylococcus aureus</italic>
, because this dataset contains strong group information (six genomes from Clonal Complex 8 (CC8) and one multi-drug resistant strain from CC30,
<italic>S. aureus</italic>
MRSA252) and showed LGT in their analysis. We investigate potential LGT into
<italic>S. aureus</italic>
TW20, a member of CC8, from MRSA252.</p>
<p>Setting
<italic>k</italic>
 = 40, we identify 1421 regions of TW20 as of lateral origin. Many of these are short and, in this simple example (where the donor group is of size 1, reducing the efficacy of the IDF component) potentially due to noise; but 173 are of length ≥2000, 52 of ≥4000 and 20 of ≥6000 nt (
<xref ref-type="table" rid="t1">Table 1</xref>
). It is unclear how to optimise selection of the length threshold, but setting it at ≥2000 nt we infer as lateral 35.6% of the genome, which incorporates 67% (4/6) of the TW20 penicillin-binding genes, and ≥50% (
<italic>i.e.</italic>
>1.5-fold over-representation) of the annotated genes encoding efflux proteins (2/4), metalloproteinases and -peptidases (3/3), permeases (31/45) and uptake proteins (2/4), types of functions known to be mobilised by LGT
<xref ref-type="bibr" rid="b11">11</xref>
<xref ref-type="bibr" rid="b30">30</xref>
. For details see
<xref ref-type="supplementary-material" rid="S1">Supplementary Table S1</xref>
. By contrast, hypothetical proteins, which might be expected to show no bias for or against lateral origin, are not enriched at any of the length thresholds mentioned above. Ribosomal proteins, which are not expected to be lateral (Jain
<italic>et al</italic>
.
<xref ref-type="bibr" rid="b30">30</xref>
), are rarely represented in our lateral regions (8/60). Phage proteins are not represented in our detected lateral regions; recalling that our approach can discover LGT only
<italic>within</italic>
the dataset, these results might accurately reflect the history of genetic relationships among these seven genomes. Scope remains for further analysis with other empirical data, and with different settings for
<italic>k</italic>
and gap size.</p>
<p>Both our TF-IDF method and ALFY identify most of the genomic region from 2.80–0.42 Mb (TF-IDF) or 2.8–0.5 Mb (ALFY) as lateral (
<xref ref-type="fig" rid="f9">Fig. 9</xref>
); this region includes two transposons, SCC elements and genes encoding methicillin and penicillin resistance. Robinson and Enright
<xref ref-type="bibr" rid="b31">31</xref>
hypothesised that the methicillin resistance, at least, had been transferred from CC30 into a CC8 background as part of a large chromosomal replacement. The region from 1.75–1.80 Mb includes the transposon Tn
<italic>554</italic>
<xref ref-type="bibr" rid="b25">25</xref>
, which encodes resistance to erythromycin and spectinomycin. A region from 2.11–2.15 Mb incorporating a number of annotated phage genes was likewise identified. Regions identified as lateral by TF-IDF but not by ALFY include 1.06–1.17 Mb (transport protein genes) and 2.64–2.65 Mb (a transporter and a member of the TetR family of regulatory proteins, which control the expression of genes involved in multidrug resistance and pathogenicity).</p>
</sec>
</sec>
<sec disp-level="1">
<title>Discussion and Conclusion</title>
<p>We have developed a fast alignment-free method to infer LGT events. Our method is based on TF-IDF, one of the most important methods used in information retrieval. TF-IDF has been widely applied in search engines, document classification and related applications including relevance decision-making. Here we apply TF-IDF to sequence analysis for the first time, treating a sequence or genome as an article and each
<italic>k</italic>
-mer as a word. Using simulated datasets, we show that TF-IDF can effectively find LGT events with good precision and recall, outperforming ALFY in most biologically realistic situations. We also analyse an empirical dataset and show that TF-IDF finds essentially all regions identified by ALFY as of lateral origin. TF-IDF further detects other regions that, based on annotated gene content, may also have arisen
<italic>via</italic>
LGT. Our method is alignment-free and scales very well in both length and number of sequences,
<italic>i.e.</italic>
to many entire genomes. It is worth noting that in each simulated dataset, all sequences share the same length and group size. For the empirical dataset, the group sizes and lengths of the seven
<italic>S. aureus</italic>
are of the same magnitude. For this reason, we did not normalise the count of
<italic>k</italic>
-mers in the IDF step. However, in other empirical datasets the sequence length and group size may vary greatly, and normalisation might be considered
<xref ref-type="bibr" rid="b28">28</xref>
.</p>
<p>Our method is purely data-driven, its performance relying strongly on sequence and group information in the dataset. In our simulations, when sequences are relatively similar within-group (variation 0.005–0.02) and relatively dissimilar between-group (variation >0.1), group boundaries are clear, and the precision and recall of our algorithm is high. When speciation is modest (<0.05), within-group divergence high (<0.1) or LGT events obscured by subsequent evolution (>0.02), TF-IDF loses precision in inferring LGT events.</p>
<p>In the accompanying article
<xref ref-type="bibr" rid="b28">28</xref>
we apply this method to larger empirical datasets. TF-IDF could further be applied to environmental data, e.g. to study the flow of genetic material in communities and across the biosphere. We anticipate that significant scope remains for further algorithmic and implementational improvements.</p>
</sec>
<sec disp-level="1">
<title>Methods</title>
<sec disp-level="2">
<title>Notation</title>
<p>Here we establish some notation. We start with a dataset of
<italic>n</italic>
sequences, each of length
<italic>L</italic>
. For empirical datasets (and for some approaches to simulation) the length may vary among sequences; in those cases we use
<italic>L</italic>
to denote the average length. The sequences in the dataset are divided into
<italic>m</italic>
groups corresponding to closely related genomes (
<italic>e.g.</italic>
belonging to the same clonal group, species or genus). We denote each sequence as
<italic>S</italic>
<sub>
<italic>i,j</italic>
</sub>
, where
<italic>i</italic>
 = 1, 2, …,
<italic>n</italic>
is the number of the sequence in the dataset and
<italic>j</italic>
 = 1, 2, …,
<italic>m</italic>
is the number of the group to which the sequence belongs. The number of sequences in group
<italic>j</italic>
is denoted by
<italic>h</italic>
<sub>
<italic>j</italic>
</sub>
.</p>
<p>Our method proceeds by comparing substrings (words) of a fixed length
<italic>k</italic>
, called
<italic>k</italic>
-mers. We encode each sequence as a frequency vector of
<italic>k</italic>
-mers, counting only those
<italic>k</italic>
-mers that actually appear in the sequence, and denoting the number of unique
<italic>k</italic>
-mers appearing in the dataset by
<italic>U</italic>
. In general,
<italic>U</italic>
is much smaller than 4
<sup>
<italic>k</italic>
</sup>
, the total number of all possible
<italic>k</italic>
-mers.</p>
<p>Although we illustrate our approach here using nucleotide sequences, the method is easily adapted for amino acids, requiring only a change of alphabet.</p>
</sec>
<sec disp-level="2">
<title>TF-IDF on texts</title>
<p>As mentioned above, TF-IDF was introduced to indicate the topic of a document, and distinguish that document from others in the same corpus for a specific query. The classical usage of TF-IDF is as a smart retrieval system and for automatic document categorisation
<xref ref-type="bibr" rid="b32">32</xref>
<xref ref-type="bibr" rid="b33">33</xref>
<xref ref-type="bibr" rid="b34">34</xref>
. A variant uses prototype vectors to calculate relevance between documents with a nearest-neighbor learning method
<xref ref-type="bibr" rid="b35">35</xref>
. PrTFIDF
<xref ref-type="bibr" rid="b36">36</xref>
is an improved version of TF-IDF founded on a probabilistic model for text categorization, and there are other variants for calculating TF-IDF
<xref ref-type="bibr" rid="b37">37</xref>
. In recent years, TF-IDF has also been applied in other areas including decision-making and sentiment analysis
<xref ref-type="bibr" rid="b24">24</xref>
<xref ref-type="bibr" rid="b38">38</xref>
.</p>
<p>TF-IDF is widely used in text mining and information retrieval because it allows the identification of terms that are characteristic of (and hence important for) one text or a set of texts. It is not sufficient for a term to be frequent in a text (TF); it must also be rare in other texts in the corpus (IDF). Importantly, IDF depends only on the occurrence of terms, not on their numerical frequencies. Drawing on analysis of documents in three independent domains, Salton and Yang
<xref ref-type="bibr" rid="b39">39</xref>
identified five situations relevant to the performance of TF-IDF:</p>
<p>
<list id="l1" list-type="order">
<list-item>
<p>Terms that appear frequently across a corpus contribute little to performance because they do not discriminate between relevant and non-relevant documents;</p>
</list-item>
<list-item>
<p>Terms that appear in a moderate number of texts and show somewhat skewed distributions provide good retrieval performance;</p>
</list-item>
<list-item>
<p>Terms with sharply skewed distribution occurring in very few documents are important only for those documents;</p>
</list-item>
<list-item>
<p>Rare terms are important for the few queries and documents in which they occur; and</p>
</list-item>
<list-item>
<p>Terms of low or moderate frequency, but with a flat distribution across documents, are similarly useful for the documents in which they occur.</p>
</list-item>
</list>
</p>
<p>Classically, the frequencies of terms in a corpus follow a power law (Zipf law), in which case TF-IDF performs well. However, TF-IDF can perform adequately even when this is not the case: TF-IDF requires only that terms relevant to the query are distributed intensively in a subset of documents within the corpus
<xref ref-type="bibr" rid="b23">23</xref>
<xref ref-type="bibr" rid="b40">40</xref>
; this might include the query terms themselves (
<italic>e.g. happy</italic>
), or related terms in the corpus (
<italic>pleased</italic>
,
<italic>delighted</italic>
).</p>
</sec>
<sec disp-level="2">
<title>TF-IDF on sequences</title>
<p>Molecular sequences have long been analogised with natural language
<xref ref-type="bibr" rid="b41">41</xref>
or treated as texts
<xref ref-type="bibr" rid="b42">42</xref>
. Alternatively, both molecular sequences and texts have been subsumed within a broader class of objects
<xref ref-type="bibr" rid="b43">43</xref>
. The analogy is not precise: in sequences, “terms” must be recognized computationally,
<italic>e.g.</italic>
by extracting
<italic>k</italic>
-mers. Fast approaches exist for extracting
<italic>k</italic>
-mers
<xref ref-type="bibr" rid="b44">44</xref>
<xref ref-type="bibr" rid="b45">45</xref>
, and
<italic>k</italic>
-mer distribution in empirical sequences has been studied at some length
<xref ref-type="bibr" rid="b46">46</xref>
<xref ref-type="bibr" rid="b47">47</xref>
<xref ref-type="bibr" rid="b48">48</xref>
. Like words in text, short
<italic>k</italic>
-mers (
<italic>k</italic>
between three and eight) in DNA sequences show Zipf-like scaling
<xref ref-type="bibr" rid="b49">49</xref>
, although this is not sufficient to confirm DNA sequences as a natural language
<xref ref-type="bibr" rid="b50">50</xref>
.</p>
<p>Although there is dispute whether DNA is a language or not, some methods in text mining have been successfully applied to DNA analysis. For example, the first (to our knowledge) software to identify lateral transfer in biological datasets
<xref ref-type="bibr" rid="b51">51</xref>
was repurposed from the analysis of textual contamination in manuscripts, which in turn was built on software for phylogenetic inference from DNA sequences
<xref ref-type="bibr" rid="b52">52</xref>
(PHYLIP).</p>
<p>Sequences (genomes, genes, proteins) do, however, differ from texts in some properties. For example,
<italic>k</italic>
-mer frequency distributions in sequences are usually much flatter than term frequencies in texts. Experience from text mining indicates that this is not critical, but this remains to be explored and is in fact a goal of the current work. In the specific application here, genomic regions of lateral origin are expected to have
<italic>k</italic>
-mers that appear frequently in genomes of the donor taxon, but rarely in the host. This is analogous to conditions 2 and/or 4 above
<xref ref-type="bibr" rid="b39">39</xref>
.</p>
<p>Our algorithm works by comparing the frequencies of identified
<italic>k</italic>
-mers in a group of sequences (our TF) with their frequencies in other groups (our IDF). If a
<italic>k</italic>
-mer in one sequence is prevalent in a different group but not in its own, then it may have arisen by LGT from the group in which it is prevalent, and the direction of the transfer should be from that (donor) group to the recipient sequence. We compare these TF and IDF statistics to appropriate thresholds to optimize detection performance,
<italic>i.e.</italic>
to balance precision and recall.</p>
<p>Our algorithm consists of four steps: extracting all
<italic>k</italic>
-mers from genomes within one dataset calculating inverse document frequencies, constructing potential LGT segments, and calculating term frequencies. For pseudocode, see the
<xref ref-type="supplementary-material" rid="S1">Supplementary file</xref>
.</p>
</sec>
<sec disp-level="2">
<title>Extracting
<italic>k</italic>
-mers</title>
<p>To extract
<italic>k</italic>
-mers we scanned all the genomes, incrementing one nucleotide at one time. If the genome length is
<italic>L</italic>
, then
<italic>L</italic>
-
<italic>k</italic>
 + 1
<italic>k</italic>
-mers are found. Unique
<italic>k</italic>
-mers were indexed in a red-black tree
<xref ref-type="bibr" rid="b53">53</xref>
for further searching.</p>
</sec>
<sec disp-level="2">
<title>Calculating IDF</title>
<p>To calculate the inverse document frequency, we construct an
<italic>n</italic>
 × 
<italic>m</italic>
relationship matrix
<italic>R</italic>
, denoting the frequency (number of occurrences) at which
<italic>k</italic>
-mers in each sequence appear in each group. Each row in
<italic>R</italic>
corresponds to a sequence, and each column corresponds to a group. Suppose sequence
<italic>i</italic>
consists of
<italic>k</italic>
-mers
<italic>w</italic>
<sub>
<italic>i</italic>
,1</sub>
,
<italic>w</italic>
<sub>
<italic>i</italic>
,2</sub>
, …,
<italic>w</italic>
<sub>
<italic>i,L-k+1</italic>
</sub>
. If the word
<italic>w</italic>
appears in group
<italic>j</italic>
with frequency
<italic>f</italic>
<sub>
<italic>j</italic>
</sub>
<italic>(w)</italic>
, then the entries of the relationship matrix are</p>
<p>
<disp-formula id="eq1">
<inline-graphic id="d33e876" xlink:href="srep30308-m1.jpg"></inline-graphic>
</disp-formula>
</p>
<p>The entries in
<italic>R</italic>
are our IDF values. The larger the
<italic>R</italic>
<sub>
<italic>ij</italic>
</sub>
, the more likely that sequence
<italic>i</italic>
contains a region transferred laterally from group
<italic>j</italic>
. Note that this is in contrast to the original definition of IDF, where a higher IDF indicates that the word appears less frequently in other documents.</p>
<p>To detect potential lateral-transfer events, we compare the IDF values against a threshold
<italic>t</italic>
. This threshold is the average value of all entries in
<italic>R</italic>
:</p>
<p>
<disp-formula id="eq2">
<inline-graphic id="d33e904" xlink:href="srep30308-m2.jpg"></inline-graphic>
</disp-formula>
</p>
<p>IDF values that are above the average are used for further analysis.</p>
</sec>
<sec disp-level="2">
<title>Constructing potential LGT segments</title>
<p>We then mark potential lateral segments in each sequence. For each sequence
<italic>i</italic>
and group
<italic>j</italic>
with a sufficiently high IDF value, we examine each
<italic>k</italic>
-mer in sequence
<italic>i</italic>
to see if it appears in group
<italic>j</italic>
. Then we join all consecutive
<italic>k</italic>
-mers which do, forming potential lateral segments. Because mutations or other genomic events may disrupt the perfect matching, we allow gaps between blocks of
<italic>k</italic>
-mers of size up to a threshold
<italic>G</italic>
. Here we set
<italic>G</italic>
 = 2
<italic>k</italic>
, a value at which the total number of detections is not greatly affected in real application
<xref ref-type="bibr" rid="b28">28</xref>
. We then assess the significance of these potential lateral segments using term frequency.</p>
</sec>
<sec disp-level="2">
<title>Calculating TF</title>
<p>For each potential lateral segment
<italic>σ</italic>
in a sequence, we calculate the frequency (number of occurrences) at which each of its component
<italic>k</italic>
-mers appears in sequences of its own group, say
<italic>j</italic>
. Our TF statistic for
<italic>σ</italic>
is the sum of these:</p>
<p>
<disp-formula id="eq3">
<inline-graphic id="d33e964" xlink:href="srep30308-m3.jpg"></inline-graphic>
</disp-formula>
</p>
<p>If
<italic>δ</italic>
<sub>
<italic>σ</italic>
</sub>
is higher than some threshold, then
<italic>σ</italic>
occurs frequently in its own group, and as such is considered not to be the consequence of a lateral event; otherwise it is considered to be of lateral origin.</p>
<p>To set the threshold, we calculate the average frequency of all unique
<italic>k</italic>
-mers in the recipient group
<italic>j</italic>
, denoted by
<italic>τ</italic>
<sub>
<italic>j</italic>
</sub>
. Then we compare
<italic>δ</italic>
<sub>
<italic>σ</italic>
</sub>
to
<italic></italic>
<sub>
<italic>j</italic>
</sub>
, where
<italic>l</italic>
is the number of
<italic>k</italic>
-mers contained in the segment. If
<italic>δ</italic>
<sub>
<italic>σ</italic>
</sub>
is smaller, we consider
<italic>σ</italic>
to have been transferred laterally from the other group into this sequence. Other approaches to setting the threshold are possible, but we do not consider them here.</p>
<p>Note that our method considers lateral transfers only within the dataset; like most other LGT methods, it is silent on potential transfers from sources external to the dataset. In addition, it can detect transfers only between groups, not between sequences in the same group.</p>
</sec>
<sec disp-level="2">
<title>Runtime analysis</title>
<p>The computational complexity of the algorithm is dominated by extraction of the unique
<italic>k</italic>
-mers in the dataset. To find these, we scan each of the
<italic>n</italic>
sequences of length
<italic>L</italic>
. As each unique
<italic>k</italic>
-mer is found it is added to a library, which is stored in a red-black tree
<xref ref-type="bibr" rid="b53">53</xref>
. A red-black tree is an approximately balanced tree, which guarantees that searching and insertion are efficient. On average, this step takes
<italic>O</italic>
(
<italic>nL log U</italic>
) time, where
<italic>U</italic>
is the number of unique
<italic>k</italic>
-mers stored in the tree. The frequency of each
<italic>k</italic>
-mer is also computed at this time. The remaining calculations are much quicker because most of the frequency (
<italic>f)</italic>
terms are zero. Thus for biological sequences of standard complexity, runtime increases about log-linearly with sequence length. Note that the
<italic>k</italic>
-mer profiles of each sequence could in principle be stored and retrieved for future use.</p>
</sec>
<sec disp-level="2">
<title>Implementation</title>
<p>We have implemented this algorithm in C++. The program can be compiled using GCC 4.8.2 and run on Unix, Unix-like and Windows platforms. We use the
<italic>map</italic>
template from STL (Standard Template Library) to index all distinct
<italic>k</italic>
-mers in a dataset. The inner implementation of
<italic>map</italic>
is a red-black tree
<xref ref-type="bibr" rid="b53">53</xref>
.</p>
</sec>
<sec disp-level="2">
<title>Comparisons with ALFY</title>
<p>ALFY finds putative homology (shared DNA segments) between pairs of sequences by matching shustrings (shortest unique substrings). If a match is found with a region in an otherwise distant sequence, it will be judged as a potential lateral transfer. This method shows high efficiency and effectiveness for LGT detection
<xref ref-type="bibr" rid="b15">15</xref>
<xref ref-type="bibr" rid="b16">16</xref>
, so we use it to benchmark our method.</p>
<p>The inputs to both TF-IDF and ALFY are sequences. For TF-IDF the group information is compulsory, while ALFY requires a query sequence and subject sequences. Both TF-IDF and ALFY can process DNA sequences; TF-IDF can also process amino-acid sequences, but ALFY does not currently implement evolutionary models of amino-acid change. Only
<italic>k</italic>
-mer frequencies will be taken into consideration for calculating the value of TF-IDF.</p>
<p>In TF-IDF, if a
<italic>k</italic>
-mer has low a frequency in its own group but high frequencies in other groups, then this
<italic>k</italic>
-mer will be judged atypical. A set of contiguous atypical
<italic>k</italic>
-mers will be inferred as lateral, with the direction of the transfer from the
<italic>k</italic>
-mer prevalent group. In contrast, ALFY computes the average shustring length between segments of only two sequences at a time. The longer the average shustring, the closer the two segments; and if the sequences themselves are otherwise distant in the reference tree, the segment in question will be inferred as lateral, without any implication of which sequence was donor or recipient.</p>
<p>If the sequences are grouped such that each group is compact and boundaries between groups are clear, then TF-IDF should find lateral segments easily. ALFY does not use group information, so grouping does not affect its performance.</p>
<p>The computational complexity of TF-IDF is
<italic>O</italic>
(
<italic>nL log nL</italic>
), where
<italic>n</italic>
is total number of sequences in a dataset, and
<italic>L</italic>
the average length of sequences in a dataset. The computational complexity of the ALFY algorithm is
<italic>O</italic>
(
<italic>nL</italic>
). However, TF-IDF will process all sequences and infer all potential lateral regions over an entire dataset, whereas ALFY makes all pairwise comparisons between a single query sequence and the others. For fairness of comparison, all sequences in a dataset should be set as queries to find all LGTs in a dataset, in which case the complexity of ALFY increases to
<italic>O</italic>
(
<italic>n</italic>
<sup>2</sup>
<italic>L</italic>
), which in practice is much slower than TF-IDF.</p>
</sec>
<sec disp-level="2">
<title>Simulation of datasets</title>
<p>In order to test the performance of TF-IDF in different situations, and to compare with ALFY, we simulated datasets under the HYK85
<xref ref-type="bibr" rid="b26">26</xref>
and F84
<xref ref-type="bibr" rid="b27">27</xref>
evolutionary models. Our simulation process is as follows:</p>
<p>
<list id="l2" list-type="order">
<list-item>
<p>We start with one random sequence, which will become the ancestor of all sequences in the dataset. We vary the length
<italic>L</italic>
of this sequence from 1000 to 300000 characters to simulate sequences from a single gene to a significant part of a genome (but our algorithm can be applied to sequences of any length).</p>
</list-item>
<list-item>
<p>To establish phyletic groups (
<italic>i.e.</italic>
to simulate speciation), the ancestral sequence is allowed to evolve along a balanced binary tree with four levels of equal branch lengths, using the evolutionary model. The branch length varies from 0.01 to 0.20 (substitutions per site) in steps of 0.05. We refer to this parameter as
<italic>variation between groups</italic>
.</p>
</list-item>
<list-item>
<p>To populate these groups with sequences, each descendant (leaf) in the initial tree (above) is allowed to evolve along another phylogenetic tree under the same evolution model. Again we use a balanced binary tree with four levels of equal branch length, which vary from 0.001 to 0.020 in steps of 0.005. We refer to this parameter as
<italic>variation within groups</italic>
.</p>
</list-item>
<list-item>
<p>We then simulate LGT events between groups. For the sake of simplicity, here we make transfers only into sequences in Group 1. We fix the number of LGT events at 20, with lengths normally distributed around mean 0.1 of the sequence length, and standard deviation half that amount. For each simulated event the recipient sequence (in Group 1) is selected at random, with (typically) several sequences receiving multiple transfers and others receiving none. Transfer events overwrite the equivalent positions in the recipient sequence, but (to simplify our simulation) cannot themselves be subsequently overwritten. Five of the 20 LGT events are simulated to come from the group (of 16 sequences) arising from the most-recent common ancestor on the binary tree (from Step 2), five from descendants of the second-most recent ancestor (32 sequences), five from the third (64 sequences) and five from the deepest bifurcation (128 sequences). Thus the probability of transfer decreases with increasing distance (on the tree) between donor and target.</p>
</list-item>
<list-item>
<p>In a final evolutionary process, we further evolve each of the 256 sequences along a balanced two-level tree, with branch lengths varying from 0 to 0.1 in steps of 0.025. We refer to this parameter as
<italic>variation post-LGT.</italic>
</p>
</list-item>
<list-item>
<p>In some simulations, Step 5 also includes a stochastic process (implemented by using a shell script to call ALF
<xref ref-type="bibr" rid="b54">54</xref>
, not to be confused with ALFY) which deletes from 0 to 10% of a sequence. The proportion was varied using the
<italic>deletion rate</italic>
setting in ALF, while keeping
<italic>deletion length distribution</italic>
at its default value. We refer to this parameter as
<italic>deletion</italic>
. We did not simulate duplications here because bacterial genomes contain very few repetitive components.</p>
</list-item>
</list>
</p>
<p>After the above steps, we select one descendant of each tree to yield our final dataset (256 sequences per simulation).</p>
<p>In addition to varying the parameters mentioned above for both TF-IDF and ALFY, for TF-IDF only we also varied the word length
<italic>k</italic>
, in steps of 10 from 20 to 50. As the number of possible parameter combinations above is very large, at Step 2 we varied only the
<italic>variation between groups</italic>
parameter while keeping all others fixed at minimal-impact settings. Similarly at Step 3 we varied only the
<italic>variation within groups</italic>
parameter. For each parameter combination we simulated 50 datasets under the F84 model of sequence change, and 50 under HYK85. This process is depicted in
<xref ref-type="fig" rid="f10">Fig. 10</xref>
, and is explained in greater detail in the
<xref ref-type="supplementary-material" rid="S1">Supplementary file</xref>
. We also analysed smaller datasets omitting Step 4, to examine whether TF-IDF inferred LGT when none was present; no segments met the IDF (
<italic>k</italic>
-mers frequent in donor groups) and TF (
<italic>k</italic>
-mers infrequent in the recipient group) criteria simultaneously, so no LGT was inferred.</p>
</sec>
<sec disp-level="2">
<title>Performance measures</title>
<p>We assessed the performance of our algorithm on simulated data using two measures. Precision is the proportion of inferred LGT events which are real (
<italic>i.e.</italic>
were actually simulated):</p>
<p>
<disp-formula id="eq4">
<inline-graphic id="d33e1222" xlink:href="srep30308-m4.jpg"></inline-graphic>
</disp-formula>
</p>
<p>where
<italic>tp</italic>
and
<italic>fp</italic>
are the total lengths of all true and false positives respectively. Recall is the proportion of true (simulated) LGTs which were inferred by the algorithm:</p>
<p>
<disp-formula id="eq5">
<inline-graphic id="d33e1233" xlink:href="srep30308-m5.jpg"></inline-graphic>
</disp-formula>
</p>
<p>where
<italic>fn</italic>
is the total length of false negatives (simulated LGTs which were not found).</p>
<p>
<xref ref-type="fig" rid="f11">Figure 11</xref>
illustrates the output of TF-IDF analysis of a simulated dataset, showing the 20 regions of (simulated) lateral origin of which 19 were detected (wholly or in part) by TF-IDF. Positions 797–877 of Sequence 11 represent a false positive inference of LGT, and positions 58–117 of Sequence 2 a false negative. Overall for this dataset (
<italic>i.e.</italic>
LGT from Groups 2–16 into Group 1), precision was 0.82 and recall 0.95. Complete details (start and end coordinates) are presented in the
<xref ref-type="supplementary-material" rid="S1">Supplementary file</xref>
.</p>
</sec>
</sec>
<sec disp-level="1">
<title>Additional Information</title>
<p>
<bold>How to cite this article</bold>
: Cong, Y.
<italic>et al</italic>
. A novel alignment-free method for detection of lateral genetic transfer based on TF-IDF.
<italic>Sci. Rep.</italic>
<bold>6</bold>
, 30308; doi: 10.1038/srep30308 (2016).</p>
</sec>
<sec sec-type="supplementary-material" id="S1">
<title>Supplementary Material</title>
<supplementary-material id="d33e42" content-type="local-data">
<caption>
<title>Supplementary Information</title>
</caption>
<media xlink:href="srep30308-s1.pdf"></media>
</supplementary-material>
</sec>
</body>
<back>
<ack>
<p>Simulations requiring large-memory hardware were performed on QRIScloud (Queensland Cyber Infrastructure Foundation). YC acknowledges the China Scholarship Council and The University of Queensland for stipend support. The project was funded in part by a grant from the James S. McDonnell Foundation to MAR.</p>
</ack>
<ref-list>
<ref id="b1">
<mixed-citation publication-type="journal">
<name>
<surname>Ochman</surname>
<given-names>H.</given-names>
</name>
,
<name>
<surname>Lawrence</surname>
<given-names>J. G.</given-names>
</name>
&
<name>
<surname>Groisman</surname>
<given-names>E. A.</given-names>
</name>
<article-title>Lateral gene transfer and the nature of bacterial innovation</article-title>
.
<source>Nature</source>
<volume>405</volume>
,
<fpage>299</fpage>
<lpage>304</lpage>
,
<pub-id pub-id-type="doi">10.1038/35012500</pub-id>
(
<year>2000</year>
).
<pub-id pub-id-type="pmid">10830951</pub-id>
</mixed-citation>
</ref>
<ref id="b2">
<mixed-citation publication-type="journal">
<name>
<surname>Schmitt</surname>
<given-names>R. M.</given-names>
</name>
<article-title>Zur Variablilität der Enteritis-bakterien</article-title>
.
<source>Zeitschr Infektionskrankh parasit Krankh Hyg Haustiere</source>
<volume>9</volume>
,
<fpage>188</fpage>
(
<year>1911</year>
).</mixed-citation>
</ref>
<ref id="b3">
<mixed-citation publication-type="journal">
<name>
<surname>Davies</surname>
<given-names>J.</given-names>
</name>
<article-title>Origins and evolution of antibiotic resistance</article-title>
.
<source>Microbiologia</source>
<volume>12</volume>
,
<fpage>9</fpage>
<lpage>16</lpage>
(
<year>1996</year>
).
<pub-id pub-id-type="pmid">9019139</pub-id>
</mixed-citation>
</ref>
<ref id="b4">
<mixed-citation publication-type="journal">
<name>
<surname>Doolittle</surname>
<given-names>W. F.</given-names>
</name>
<article-title>Phylogenetic classification and the universal tree</article-title>
.
<source>Science</source>
<volume>284</volume>
,
<fpage>2124</fpage>
<lpage>2129</lpage>
(
<year>1999</year>
).
<pub-id pub-id-type="pmid">10381871</pub-id>
</mixed-citation>
</ref>
<ref id="b5">
<mixed-citation publication-type="journal">
<name>
<surname>Martin</surname>
<given-names>W.</given-names>
</name>
<article-title>Mosaic bacterial chromosomes: a challenge on route to a tree of genomes</article-title>
.
<source>Bioessays</source>
<volume>21</volume>
,
<fpage>99</fpage>
<lpage>104</lpage>
,
<pub-id pub-id-type="doi">10.1002/(Sici)1521-1878(199902)21:2<99::Aid-Bies3>3.0.Co;2-B</pub-id>
(
<year>1999</year>
).
<pub-id pub-id-type="pmid">10193183</pub-id>
</mixed-citation>
</ref>
<ref id="b6">
<mixed-citation publication-type="journal">
<name>
<surname>Beiko</surname>
<given-names>R. G.</given-names>
</name>
,
<name>
<surname>Harlow</surname>
<given-names>T. J.</given-names>
</name>
&
<name>
<surname>Ragan</surname>
<given-names>M. A.</given-names>
</name>
<article-title>Highways of gene sharing in prokaryotes</article-title>
.
<source>Proc. Natl Acad. Sci. USA</source>
<volume>102</volume>
,
<fpage>14332</fpage>
<lpage>14337</lpage>
,
<pub-id pub-id-type="doi">10.1073/pnas.0504068102</pub-id>
(
<year>2005</year>
).
<pub-id pub-id-type="pmid">16176988</pub-id>
</mixed-citation>
</ref>
<ref id="b7">
<mixed-citation publication-type="journal">
<name>
<surname>Raymond</surname>
<given-names>J.</given-names>
</name>
,
<name>
<surname>Siefert</surname>
<given-names>J. L.</given-names>
</name>
,
<name>
<surname>Staples</surname>
<given-names>C. R.</given-names>
</name>
&
<name>
<surname>Blankenship</surname>
<given-names>R. E.</given-names>
</name>
<article-title>The natural history of nitrogen fixation</article-title>
.
<source>Mol. Biol. Evol.</source>
<volume>21</volume>
,
<fpage>541</fpage>
<lpage>554</lpage>
,
<pub-id pub-id-type="doi">10.1093/molbev/msh047</pub-id>
(
<year>2004</year>
).
<pub-id pub-id-type="pmid">14694078</pub-id>
</mixed-citation>
</ref>
<ref id="b8">
<mixed-citation publication-type="journal">
<name>
<surname>Thomas</surname>
<given-names>C. M.</given-names>
</name>
&
<name>
<surname>Nielsen</surname>
<given-names>K. M.</given-names>
</name>
<article-title>Mechanisms of, and barriers to, horizontal gene transfer between bacteria</article-title>
.
<source>Nat. Rev. Microbiol.</source>
<volume>3</volume>
,
<fpage>711</fpage>
<lpage>721</lpage>
(
<year>2005</year>
).
<pub-id pub-id-type="pmid">16138099</pub-id>
</mixed-citation>
</ref>
<ref id="b9">
<mixed-citation publication-type="journal">
<name>
<surname>Skippington</surname>
<given-names>E.</given-names>
</name>
&
<name>
<surname>Ragan</surname>
<given-names>M. A.</given-names>
</name>
<article-title>Lateral genetic transfer and the construction of genetic exchange communities</article-title>
.
<source>FEMS Microbiol. Rev.</source>
<volume>35</volume>
,
<fpage>707</fpage>
<lpage>735</lpage>
,
<pub-id pub-id-type="doi">10.1111/j.1574-6976.2010.00261.x</pub-id>
(
<year>2011</year>
).
<pub-id pub-id-type="pmid">21223321</pub-id>
</mixed-citation>
</ref>
<ref id="b10">
<mixed-citation publication-type="journal">
<name>
<surname>Chan</surname>
<given-names>C. X.</given-names>
</name>
,
<name>
<surname>Darling</surname>
<given-names>A. E.</given-names>
</name>
,
<name>
<surname>Beiko</surname>
<given-names>R. G.</given-names>
</name>
&
<name>
<surname>Ragan</surname>
<given-names>M. A.</given-names>
</name>
<article-title>Are protein domains modules of lateral genetic transfer?</article-title>
<source>PLoS ONE.</source>
<volume>4</volume>
,
<fpage>e4524</fpage>
,
<pub-id pub-id-type="doi">10.1371/journal.pone.0004524</pub-id>
(
<year>2009</year>
).
<pub-id pub-id-type="pmid">19229333</pub-id>
</mixed-citation>
</ref>
<ref id="b11">
<mixed-citation publication-type="journal">
<name>
<surname>Ragan</surname>
<given-names>M. A.</given-names>
</name>
&
<name>
<surname>Beiko</surname>
<given-names>R. G.</given-names>
</name>
<article-title>Lateral genetic transfer: open issues</article-title>
.
<source>Philos. Trans. R. Soc. Lond. B. Biol. Sci.</source>
<volume>364</volume>
,
<fpage>2241</fpage>
<lpage>2251</lpage>
,
<pub-id pub-id-type="doi">10.1098/rstb.2009.0031</pub-id>
(
<year>2009</year>
).
<pub-id pub-id-type="pmid">19571244</pub-id>
</mixed-citation>
</ref>
<ref id="b12">
<mixed-citation publication-type="journal">
<name>
<surname>Lawrence</surname>
<given-names>J. G.</given-names>
</name>
&
<name>
<surname>Ochman</surname>
<given-names>H.</given-names>
</name>
<article-title>Amelioration of bacterial genomes: Rates of change and exchange</article-title>
.
<source>J. Mol. Evol.</source>
<volume>44</volume>
,
<fpage>383</fpage>
<lpage>397</lpage>
,
<pub-id pub-id-type="doi">10.1007/Pl00006158</pub-id>
(
<year>1997</year>
).
<pub-id pub-id-type="pmid">9089078</pub-id>
</mixed-citation>
</ref>
<ref id="b13">
<mixed-citation publication-type="journal">
<name>
<surname>Ragan</surname>
<given-names>M. A.</given-names>
</name>
<article-title>On surrogate methods for detecting lateral gene transfer</article-title>
.
<source>FEMS Microbiol. Lett.</source>
<volume>201</volume>
,
<fpage>187</fpage>
<lpage>191</lpage>
,
<pub-id pub-id-type="doi">10.1111/J.1574-6968.2001.Tb10755.X</pub-id>
(
<year>2001</year>
).
<pub-id pub-id-type="pmid">11470360</pub-id>
</mixed-citation>
</ref>
<ref id="b14">
<mixed-citation publication-type="journal">
<name>
<surname>Lawrence</surname>
<given-names>J. G.</given-names>
</name>
&
<name>
<surname>Ochman</surname>
<given-names>H.</given-names>
</name>
<article-title>Reconciling the many faces of lateral gene transfer</article-title>
.
<source>Trends. Microbiol.</source>
<volume>10</volume>
,
<fpage>1</fpage>
<lpage>4</lpage>
,
<pub-id pub-id-type="doi">10.1016/S0966-842x(01)02282-X</pub-id>
(
<year>2002</year>
).
<pub-id pub-id-type="pmid">11755071</pub-id>
</mixed-citation>
</ref>
<ref id="b15">
<mixed-citation publication-type="journal">
<name>
<surname>Domazet-Lošo</surname>
<given-names>M.</given-names>
</name>
&
<name>
<surname>Haubold</surname>
<given-names>B.</given-names>
</name>
<article-title>Alignment-free detection of horizontal gene transfer between closely related bacterial genomes</article-title>
.
<source>Mob. Genet. Elements</source>
<volume>1</volume>
,
<fpage>230</fpage>
<lpage>235</lpage>
,
<pub-id pub-id-type="doi">10.4161/mge.1.3.18065</pub-id>
(
<year>2011</year>
).
<pub-id pub-id-type="pmid">22312592</pub-id>
</mixed-citation>
</ref>
<ref id="b16">
<mixed-citation publication-type="journal">
<name>
<surname>Domazet-Lošo</surname>
<given-names>M.</given-names>
</name>
&
<name>
<surname>Haubold</surname>
<given-names>B.</given-names>
</name>
<article-title>Alignment-free detection of local similarity among viral and bacterial genomes</article-title>
.
<source>Bioinformatics</source>
<volume>27</volume>
,
<fpage>1466</fpage>
<lpage>1472</lpage>
,
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr176</pub-id>
(
<year>2011</year>
).
<pub-id pub-id-type="pmid">21471011</pub-id>
</mixed-citation>
</ref>
<ref id="b17">
<mixed-citation publication-type="journal">
<name>
<surname>Domazet-Lošo</surname>
<given-names>M.</given-names>
</name>
&
<name>
<surname>Haubold</surname>
<given-names>B.</given-names>
</name>
<article-title>Efficient estimation of pairwise distances between genomes</article-title>
.
<source>Bioinformatics</source>
<volume>25</volume>
,
<fpage>3221</fpage>
<lpage>3227</lpage>
,
<pub-id pub-id-type="doi">10.1093/bioinformatics/btp590</pub-id>
(
<year>2009</year>
).
<pub-id pub-id-type="pmid">19825795</pub-id>
</mixed-citation>
</ref>
<ref id="b18">
<mixed-citation publication-type="journal">
<name>
<surname>Saitou</surname>
<given-names>N.</given-names>
</name>
&
<name>
<surname>Nei</surname>
<given-names>M.</given-names>
</name>
<article-title>The neighbor-joining method - a new method for reconstructing phylogenetic trees</article-title>
.
<source>Mol. Biol. Evol.</source>
<volume>4</volume>
,
<fpage>406</fpage>
<lpage>425</lpage>
(
<year>1987</year>
).
<pub-id pub-id-type="pmid">3447015</pub-id>
</mixed-citation>
</ref>
<ref id="b19">
<mixed-citation publication-type="journal">
<name>
<surname>Taniguchi</surname>
<given-names>Y.</given-names>
</name>
,
<name>
<surname>Yamada</surname>
<given-names>Y.</given-names>
</name>
,
<name>
<surname>Maruyama</surname>
<given-names>O.</given-names>
</name>
,
<name>
<surname>Kuhara</surname>
<given-names>S.</given-names>
</name>
&
<name>
<surname>Ikeda</surname>
<given-names>D.</given-names>
</name>
<article-title>The purity measure for genomic regions leads to horizontally transferred genes</article-title>
.
<source>J. Bioinf. Comput. Biol.</source>
<volume>11</volume>
,
<fpage>1343002</fpage>
, doi: Artn 1343002
<pub-id pub-id-type="doi">10.1142/S0219720013430026</pub-id>
(
<year>2013</year>
).</mixed-citation>
</ref>
<ref id="b20">
<mixed-citation publication-type="other">
<name>
<surname>Gusfield</surname>
<given-names>D.</given-names>
</name>
Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology (Cambridge University Press,
<year>1997</year>
).</mixed-citation>
</ref>
<ref id="b21">
<mixed-citation publication-type="journal">
<name>
<surname>Luhn</surname>
<given-names>H. P.</given-names>
</name>
<article-title>The automatic creation of literature abstracts</article-title>
.
<source>IBM J. Res. Dev.</source>
<volume>2</volume>
,
<fpage>159</fpage>
<lpage>165</lpage>
(
<year>1958</year>
).</mixed-citation>
</ref>
<ref id="b22">
<mixed-citation publication-type="journal">
<name>
<surname>Jones</surname>
<given-names>K. S.</given-names>
</name>
<article-title>A statistical interpretation of term specificity and its application in retrieval</article-title>
.
<source>J. Docum</source>
<volume>28</volume>
,
<fpage>11</fpage>
<lpage>21</lpage>
(
<year>1972</year>
).</mixed-citation>
</ref>
<ref id="b23">
<mixed-citation publication-type="journal">
<name>
<surname>Salton</surname>
<given-names>G.</given-names>
</name>
&
<name>
<surname>Buckley</surname>
<given-names>C.</given-names>
</name>
<article-title>Term-weighting approaches in automatic text retrieval</article-title>
.
<source>Inform. Process Manag.</source>
<volume>24</volume>
,
<fpage>513</fpage>
<lpage>523</lpage>
,
<pub-id pub-id-type="doi">10.1016/0306-4573(88)90021-0</pub-id>
(
<year>1988</year>
).</mixed-citation>
</ref>
<ref id="b24">
<mixed-citation publication-type="journal">
<name>
<surname>Wu</surname>
<given-names>H. C.</given-names>
</name>
,
<name>
<surname>Luk</surname>
<given-names>R. W. P.</given-names>
</name>
,
<name>
<surname>Wong</surname>
<given-names>K. F.</given-names>
</name>
&
<name>
<surname>Kwok</surname>
<given-names>K. L.</given-names>
</name>
<article-title>Interpreting TF-IDF term weights as making relevance decisions</article-title>
.
<source>ACM T. Inform. Syst.</source>
<volume>26</volume>
, doi: Artn
<pub-id pub-id-type="doi">10.1145/1361684.1361686</pub-id>
(
<year>2008</year>
).</mixed-citation>
</ref>
<ref id="b25">
<mixed-citation publication-type="journal">
<name>
<surname>Holden</surname>
<given-names>M. T.</given-names>
</name>
<etal></etal>
.
<article-title>Genome sequence of a recently emerged, highly transmissible, multi-antibiotic- and antiseptic-resistant variant of methicillin-resistant Staphylococcus aureus, sequence type 239 (TW)</article-title>
.
<source>J. Bacteriol.</source>
<volume>192</volume>
,
<fpage>888</fpage>
<lpage>892</lpage>
,
<pub-id pub-id-type="doi">10.1128/JB.01255-09</pub-id>
(
<year>2010</year>
).
<pub-id pub-id-type="pmid">19948800</pub-id>
</mixed-citation>
</ref>
<ref id="b26">
<mixed-citation publication-type="journal">
<name>
<surname>Hasegawa</surname>
<given-names>M.</given-names>
</name>
,
<name>
<surname>Kishino</surname>
<given-names>H.</given-names>
</name>
&
<name>
<surname>Yano</surname>
<given-names>T.</given-names>
</name>
<article-title>Dating of the human-ape splitting by a molecular clock of mitochondrial DNA</article-title>
.
<source>J. Mol. Evol.</source>
<volume>22</volume>
,
<fpage>160</fpage>
<lpage>174</lpage>
(
<year>1985</year>
).
<pub-id pub-id-type="pmid">3934395</pub-id>
</mixed-citation>
</ref>
<ref id="b27">
<mixed-citation publication-type="journal">
<name>
<surname>Felsenstein</surname>
<given-names>J.</given-names>
</name>
&
<name>
<surname>Churchill</surname>
<given-names>G. A.</given-names>
</name>
<article-title>A hidden Markov model approach to variation among sites in rate of evolution</article-title>
.
<source>Mol. Biol. Evol.</source>
<volume>13</volume>
,
<fpage>93</fpage>
<lpage>104</lpage>
(
<year>1996</year>
).
<pub-id pub-id-type="pmid">8583911</pub-id>
</mixed-citation>
</ref>
<ref id="b28">
<mixed-citation publication-type="journal">
<name>
<surname>Cong</surname>
<given-names>Y.</given-names>
</name>
,
<name>
<surname>Chan</surname>
<given-names>Y.-b.</given-names>
</name>
&
<name>
<surname>Ragan</surname>
<given-names>M. A.</given-names>
</name>
<article-title>Exploring lateral genetic transfer among microbial genomes using TF-IDF</article-title>
.
<source>Scientific Reports</source>
<volume>6</volume>
,
<fpage>29319</fpage>
(
<year>2016</year>
).</mixed-citation>
</ref>
<ref id="b29">
<mixed-citation publication-type="journal">
<name>
<surname>Popa</surname>
<given-names>O.</given-names>
</name>
,
<name>
<surname>Hazkani-Covo</surname>
<given-names>E.</given-names>
</name>
,
<name>
<surname>Landan</surname>
<given-names>G.</given-names>
</name>
,
<name>
<surname>Martin</surname>
<given-names>W.</given-names>
</name>
&
<name>
<surname>Dagan</surname>
<given-names>T.</given-names>
</name>
<article-title>Directed networks reveal genomic barriers and DNA repair bypasses to lateral gene transfer among prokaryotes</article-title>
.
<source>Genome Res.</source>
<volume>21</volume>
,
<fpage>599</fpage>
<lpage>609</lpage>
,
<pub-id pub-id-type="doi">10.1101/gr.115592.110</pub-id>
(
<year>2011</year>
).
<pub-id pub-id-type="pmid">21270172</pub-id>
</mixed-citation>
</ref>
<ref id="b30">
<mixed-citation publication-type="journal">
<name>
<surname>Jain</surname>
<given-names>R.</given-names>
</name>
,
<name>
<surname>Rivera</surname>
<given-names>M. C.</given-names>
</name>
&
<name>
<surname>Lake</surname>
<given-names>J. A.</given-names>
</name>
<article-title>Horizontal gene transfer among genomes: The complexity hypothesis</article-title>
.
<source>Proc. Natl Acad. Sci. USA</source>
<volume>96</volume>
,
<fpage>3801</fpage>
<lpage>3806</lpage>
,
<pub-id pub-id-type="doi">10.1073/Pnas.96.7.3801</pub-id>
(
<year>1999</year>
).
<pub-id pub-id-type="pmid">10097118</pub-id>
</mixed-citation>
</ref>
<ref id="b31">
<mixed-citation publication-type="journal">
<name>
<surname>Robinson</surname>
<given-names>D. A.</given-names>
</name>
&
<name>
<surname>Enright</surname>
<given-names>M. C.</given-names>
</name>
<article-title>Evolution of Staphylococcus aureus by large chromosomal replacements</article-title>
.
<source>J. Bacteriol.</source>
<volume>186</volume>
,
<fpage>1060</fpage>
<lpage>1064</lpage>
,
<pub-id pub-id-type="doi">10.1128/Jb.186.4.1060-1064.2004</pub-id>
(
<year>2004</year>
).
<pub-id pub-id-type="pmid">14762000</pub-id>
</mixed-citation>
</ref>
<ref id="b32">
<mixed-citation publication-type="journal">
<name>
<surname>Salton</surname>
<given-names>G.</given-names>
</name>
<source>The SMART retrieval system; experiments in automatic document processing</source>
(Prentice-Hall,
<year>1971</year>
).</mixed-citation>
</ref>
<ref id="b33">
<mixed-citation publication-type="journal">
<name>
<surname>Salton</surname>
<given-names>G.</given-names>
</name>
&
<name>
<surname>McGill</surname>
<given-names>M. J.</given-names>
</name>
<source>Introduction to modern information retrieval</source>
(McGraw-Hill,
<year>1983</year>
).</mixed-citation>
</ref>
<ref id="b34">
<mixed-citation publication-type="journal">
<name>
<surname>Salton</surname>
<given-names>G.</given-names>
</name>
&
<name>
<surname>McGill</surname>
<given-names>M. J.</given-names>
</name>
<article-title>The SMART and SIRE experimental retrieval systems</article-title>
in
<source>Readings in information retrieval</source>
(eds
<name>
<surname>Sparck Jones</surname>
<given-names>K</given-names>
</name>
&
<name>
<surname>Willett</surname>
<given-names>P</given-names>
</name>
)
<fpage>381</fpage>
<lpage>399</lpage>
(Morgan Kaufmann Publishers Inc.,
<year>1997</year>
).</mixed-citation>
</ref>
<ref id="b35">
<mixed-citation publication-type="journal">
<name>
<surname>Salton</surname>
<given-names>G.</given-names>
</name>
<article-title>Developments in automatic text retrieval</article-title>
.
<source>Science</source>
<volume>253</volume>
,
<fpage>974</fpage>
<lpage>980</lpage>
,
<pub-id pub-id-type="doi">10.1126/Science.253.5023.974</pub-id>
(
<year>1991</year>
).
<pub-id pub-id-type="pmid">17775340</pub-id>
</mixed-citation>
</ref>
<ref id="b36">
<mixed-citation publication-type="other">
<name>
<surname>Joachims</surname>
<given-names>T.</given-names>
</name>
A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In
<italic>Proceedings of the Fourteenth International Conference on Machine Learning</italic>
,
<fpage>143</fpage>
<lpage>151</lpage>
(
<year>1996</year>
).</mixed-citation>
</ref>
<ref id="b37">
<mixed-citation publication-type="journal">
<name>
<surname>Zobel</surname>
<given-names>J.</given-names>
</name>
&
<name>
<surname>Moffat</surname>
<given-names>A.</given-names>
</name>
<article-title>Exploring the similarity space</article-title>
.
<source>SIGIR Forum</source>
<volume>32</volume>
,
<fpage>18</fpage>
<lpage>34</lpage>
(
<year>1998</year>
).</mixed-citation>
</ref>
<ref id="b38">
<mixed-citation publication-type="other">
<name>
<surname>Paltoglou</surname>
<given-names>G.</given-names>
</name>
&
<name>
<surname>Thelwall</surname>
<given-names>M.</given-names>
</name>
in
<italic>Proc. of the 48th Annual Meeting of the Association for Computational Linguistics</italic>
1386–1395 (Association for Computational Linguistics, Uppsala, Sweden,
<year>2010</year>
).</mixed-citation>
</ref>
<ref id="b39">
<mixed-citation publication-type="journal">
<name>
<surname>Salton</surname>
<given-names>G.</given-names>
</name>
&
<name>
<surname>Yang</surname>
<given-names>C.-S.</given-names>
</name>
<article-title>On the specification of term values in automatic indexing</article-title>
.
<source>J. Docum</source>
<volume>29</volume>
,
<fpage>351</fpage>
<lpage>372</lpage>
(
<year>1973</year>
).</mixed-citation>
</ref>
<ref id="b40">
<mixed-citation publication-type="journal">
<name>
<surname>Salton</surname>
<given-names>G.</given-names>
</name>
,
<name>
<surname>Yang</surname>
<given-names>C.-S.</given-names>
</name>
&
<name>
<surname>Yu</surname>
<given-names>C. T.</given-names>
</name>
<article-title>A theory of term importance in automatic text analysis</article-title>
.
<source>J. Am. Soc. Inf. Sci.</source>
<volume>26</volume>
,
<fpage>33</fpage>
<lpage>44</lpage>
(
<year>1975</year>
).</mixed-citation>
</ref>
<ref id="b41">
<mixed-citation publication-type="journal">
<name>
<surname>Nussinov</surname>
<given-names>R.</given-names>
</name>
<article-title>Some rules in the ordering of nucleotides in the DNA</article-title>
.
<source>Nucleic Acids Res.</source>
<volume>8</volume>
,
<fpage>4545</fpage>
<lpage>4562</lpage>
(
<year>1980</year>
).
<pub-id pub-id-type="pmid">7433114</pub-id>
</mixed-citation>
</ref>
<ref id="b42">
<mixed-citation publication-type="journal">
<name>
<surname>Koonin</surname>
<given-names>E. V.</given-names>
</name>
&
<name>
<surname>Galperin</surname>
<given-names>M. Y.</given-names>
</name>
In
<source>Sequence - Evolution - Function: Computational Approaches in Comparative Genomics</source>
(Kluwe Academic,
<year>2003</year>
).</mixed-citation>
</ref>
<ref id="b43">
<mixed-citation publication-type="journal">
<name>
<surname>Kruskal</surname>
<given-names>J. B.</given-names>
</name>
<article-title>An overview of sequence comparison - time warps, string edits, and macromolecules</article-title>
.
<source>S.I.A.M Rev.</source>
<volume>25</volume>
,
<fpage>201</fpage>
<lpage>237</lpage>
,
<pub-id pub-id-type="doi">10.1137/1025045</pub-id>
(
<year>1983</year>
).</mixed-citation>
</ref>
<ref id="b44">
<mixed-citation publication-type="journal">
<name>
<surname>Marçais</surname>
<given-names>G.</given-names>
</name>
&
<name>
<surname>Kingsford</surname>
<given-names>C.</given-names>
</name>
<article-title>A fast, lock-free approach for efficient parallel counting of occurrences of k-mers</article-title>
.
<source>Bioinformatics</source>
<volume>27</volume>
,
<fpage>764</fpage>
<lpage>770</lpage>
(
<year>2011</year>
).
<pub-id pub-id-type="pmid">21217122</pub-id>
</mixed-citation>
</ref>
<ref id="b45">
<mixed-citation publication-type="journal">
<name>
<surname>Greenfield</surname>
<given-names>P.</given-names>
</name>
,
<name>
<surname>Duesing</surname>
<given-names>K.</given-names>
</name>
,
<name>
<surname>Papanicolaou</surname>
<given-names>A.</given-names>
</name>
&
<name>
<surname>Bauer</surname>
<given-names>D. C.</given-names>
</name>
<article-title>Blue: correcting sequencing errors using consensus and context</article-title>
.
<source>Bioinformatics</source>
<volume>30</volume>
,
<fpage>2723</fpage>
<lpage>2732</lpage>
(
<year>2014</year>
).
<pub-id pub-id-type="pmid">24919879</pub-id>
</mixed-citation>
</ref>
<ref id="b46">
<mixed-citation publication-type="journal">
<name>
<surname>Chor</surname>
<given-names>B.</given-names>
</name>
,
<name>
<surname>Horn</surname>
<given-names>D.</given-names>
</name>
,
<name>
<surname>Goldman</surname>
<given-names>N.</given-names>
</name>
,
<name>
<surname>Levy</surname>
<given-names>Y.</given-names>
</name>
&
<name>
<surname>Massingham</surname>
<given-names>T.</given-names>
</name>
<article-title>Genomic DNA k-mer spectra: models and modalities</article-title>
.
<source>Genome. Biol</source>
<volume>10</volume>
,
<fpage>R108</fpage>
,
<pub-id pub-id-type="doi">10.1186/gb-2009-10-10-r108</pub-id>
(
<year>2009</year>
).
<pub-id pub-id-type="pmid">19814784</pub-id>
</mixed-citation>
</ref>
<ref id="b47">
<mixed-citation publication-type="journal">
<name>
<surname>Burden</surname>
<given-names>C. J.</given-names>
</name>
,
<name>
<surname>Leopardi</surname>
<given-names>P.</given-names>
</name>
&
<name>
<surname>Foret</surname>
<given-names>S.</given-names>
</name>
<article-title>The distribution of word matches between Markovian sequences with periodic boundary conditions</article-title>
.
<source>J. Comput. Biol.</source>
<volume>21</volume>
,
<fpage>41</fpage>
<lpage>63</lpage>
,
<pub-id pub-id-type="doi">10.1089/Cmb.2012.0277</pub-id>
(
<year>2014</year>
).
<pub-id pub-id-type="pmid">24160839</pub-id>
</mixed-citation>
</ref>
<ref id="b48">
<mixed-citation publication-type="journal">
<name>
<surname>Kurtz</surname>
<given-names>S.</given-names>
</name>
,
<name>
<surname>Narechania</surname>
<given-names>A.</given-names>
</name>
,
<name>
<surname>Stein</surname>
<given-names>J. C.</given-names>
</name>
&
<name>
<surname>Ware</surname>
<given-names>D.</given-names>
</name>
<article-title>A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes</article-title>
.
<source>BMC Genomics</source>
<volume>9</volume>
,
<fpage>517</fpage>
(
<year>2008</year>
).
<pub-id pub-id-type="pmid">18976482</pub-id>
</mixed-citation>
</ref>
<ref id="b49">
<mixed-citation publication-type="journal">
<name>
<surname>Mantegna</surname>
<given-names>R. N.</given-names>
</name>
<etal></etal>
.
<article-title>Linguistic features of noncoding DNA-sequences</article-title>
.
<source>Phys. Rev. Lett.</source>
<volume>73</volume>
,
<fpage>3169</fpage>
<lpage>3172</lpage>
,
<pub-id pub-id-type="doi">10.1103/Physrevlett.73.3169</pub-id>
(
<year>1994</year>
).
<pub-id pub-id-type="pmid">10057305</pub-id>
</mixed-citation>
</ref>
<ref id="b50">
<mixed-citation publication-type="journal">
<name>
<surname>Tsonis</surname>
<given-names>A. A.</given-names>
</name>
,
<name>
<surname>Elsner</surname>
<given-names>J. B.</given-names>
</name>
&
<name>
<surname>Tsonis</surname>
<given-names>P. A.</given-names>
</name>
<article-title>Is DNA a language?</article-title>
<source>J. Theor. Biol.</source>
<volume>184</volume>
,
<fpage>25</fpage>
<lpage>29</lpage>
,
<pub-id pub-id-type="doi">10.1006/Jtbi.1996.0239</pub-id>
(
<year>1997</year>
).
<pub-id pub-id-type="pmid">9039397</pub-id>
</mixed-citation>
</ref>
<ref id="b51">
<mixed-citation publication-type="journal">
<name>
<surname>Ragan</surname>
<given-names>M. A.</given-names>
</name>
&
<name>
<surname>Lee</surname>
<given-names>A. R.</given-names>
<suffix>III</suffix>
</name>
<article-title>Making phylogenetic sense of biochemical and morphological diversity among the protists</article-title>
in
<source>The Unity of Evolutionary Biology: 4th International Congress of Systematic and Evolutionary Biology</source>
(ed.
<name>
<surname>Dudley</surname>
<given-names>T. R.</given-names>
</name>
)
<volume>Vol. 2</volume>
,
<fpage>432</fpage>
<lpage>441</lpage>
(Dioscorides Press, Portland, Oregon,
<year>1991</year>
).</mixed-citation>
</ref>
<ref id="b52">
<mixed-citation publication-type="other">
<name>
<surname>Felsenstein</surname>
<given-names>J.</given-names>
</name>
PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. Department of Genome Sciences, Universityy of Washington, Seattle. (
<year>2005</year>
).</mixed-citation>
</ref>
<ref id="b53">
<mixed-citation publication-type="other">
<name>
<surname>Guibas</surname>
<given-names>L. J.</given-names>
</name>
&
<name>
<surname>Sedgewick</surname>
<given-names>R.</given-names>
</name>
A dichromatic framework for balanced trees in
<italic>Proceedings of the 19th Annual Symposium on Foundations of Computer Science</italic>
, 8–21 (Institute of Electrical and Electronics Engineers,
<year>1995</year>
).</mixed-citation>
</ref>
<ref id="b54">
<mixed-citation publication-type="journal">
<name>
<surname>Dalquen</surname>
<given-names>D. A.</given-names>
</name>
,
<name>
<surname>Anisimova</surname>
<given-names>M.</given-names>
</name>
,
<name>
<surname>Gonnet</surname>
<given-names>G. H.</given-names>
</name>
&
<name>
<surname>Dessimoz</surname>
<given-names>C.</given-names>
</name>
<article-title>ALF–a simulation framework for genome evolution</article-title>
.
<source>Mol. Biol. Evol.</source>
<volume>29</volume>
,
<fpage>1115</fpage>
<lpage>1123</lpage>
,
<pub-id pub-id-type="doi">10.1093/molbev/msr268</pub-id>
(
<year>2012</year>
).
<pub-id pub-id-type="pmid">22160766</pub-id>
</mixed-citation>
</ref>
</ref-list>
<fn-group>
<fn>
<p>
<bold>Author Contributions</bold>
The method was devised and implemented by Y.C. Y.C., Y.-B.C. and M.A.R. designed the experiments, and analysed and interpreted the results. All authors contributed to the writing of the paper, and approved the final manuscript.</p>
</fn>
</fn-group>
</back>
<floats-group>
<fig id="f1">
<label>Figure 1</label>
<caption>
<title>Performance of TF-IDF with variation between groups.</title>
<p>Precision (
<bold>A</bold>
) increases with variation between groups. Recall (
<bold>B</bold>
) is not substantially affected by variation between groups. Variation within groups is 0.01, variation post-LGT is zero, and deletion is zero. Error bars are 2× standard error.</p>
</caption>
<graphic xlink:href="srep30308-f1"></graphic>
</fig>
<fig id="f2">
<label>Figure 2</label>
<caption>
<title>Performance of TF-IDF with variation within groups.</title>
<p>Precision (
<bold>A</bold>
) increases with variation within groups, while recall (
<bold>B</bold>
) is essentially unchanged. Variation between groups is 0.1, variation post-LGT is zero, and deletion is zero. Error bars are 2× standard error.</p>
</caption>
<graphic xlink:href="srep30308-f2"></graphic>
</fig>
<fig id="f3">
<label>Figure 3</label>
<caption>
<title>Performance of TF-IDF with variation post-LGT and deletion.</title>
<p>Precision (
<bold>A</bold>
) decreases with variation post-LGT, but is unaffected by deletion. Recall (
<bold>B</bold>
) decreases greatly with variation post-LGT and slightly with deletion. Variation between groups is 0.1, and variation within groups is 0.01. Sequence length is 300,000 nt.</p>
</caption>
<graphic xlink:href="srep30308-f3"></graphic>
</fig>
<fig id="f4">
<label>Figure 4</label>
<caption>
<title>Performance of ALFY with variation post-LGT.</title>
<p>Precision (
<bold>A</bold>
) and recall (
<bold>B</bold>
) decrease with variation post-LGT. Variation between groups is 0.1, variation within groups is 0.01, and deletion is zero. Error bars are 2× standard error.</p>
</caption>
<graphic xlink:href="srep30308-f4"></graphic>
</fig>
<fig id="f5">
<label>Figure 5</label>
<caption>
<title>Performance of ALFY with deletion.</title>
<p>Precision (
<bold>A</bold>
) is stable with deletion. Recall (
<bold>B</bold>
) decreases with deletion. Variation between groups is 0.1, variation within groups is 0.01, and variation post-LGT is 0.01. Error bars are 2× standard error.</p>
</caption>
<graphic xlink:href="srep30308-f5"></graphic>
</fig>
<fig id="f6">
<label>Figure 6</label>
<caption>
<title>Performance of TF-IDF with
<italic>k</italic>
-mer size.</title>
<p>Precision (
<bold>A</bold>
) increases with
<italic>k</italic>
, while recall (
<bold>B</bold>
) decreases with
<italic>k</italic>
. Error bars are 2× standard error.</p>
</caption>
<graphic xlink:href="srep30308-f6"></graphic>
</fig>
<fig id="f7">
<label>Figure 7</label>
<caption>
<title>Log-log plot of sequence length against walltime.</title>
<p>See text for details.</p>
</caption>
<graphic xlink:href="srep30308-f7"></graphic>
</fig>
<fig id="f8">
<label>Figure 8</label>
<caption>
<title>
<italic>Log U</italic>
against time divided by sequence length.</title>
<p>The slope of the regression line is 0.0002, and the grey area is the 95% confidence interval.</p>
</caption>
<graphic xlink:href="srep30308-f8"></graphic>
</fig>
<fig id="f9">
<label>Figure 9</label>
<caption>
<title>Comparison of TF-IDF and ALFY with an empirical dataset.</title>
<p>Both A and B represent the genome of
<italic>Staphylococcus aureus</italic>
TW20. A shows the result of ALFY analysis
<xref ref-type="bibr" rid="b15">15</xref>
; regions inferred to have been transferred from MRSA252 are represented in black, while regions homologous between TW20 and USA300.TCH15156 are shown in grey. B shows the result of TF-IDF analysis. TF-IDF can infer LGT only from outside the target group, so no region is in grey. Both plots were generated from analysis of the seven
<italic>S. aureus</italic>
genome dataset.</p>
</caption>
<graphic xlink:href="srep30308-f9"></graphic>
</fig>
<fig id="f10">
<label>Figure 10</label>
<caption>
<title>Overview of data simulation.</title>
<p>Step 1: The simulation starts with a single ancestor and generates 16 sequences, which serve as ancestors for each group (variation between groups). Step 2: Within each group we generate 16 descendants (variation within groups), then add LGT events between these groups. Step 3: Finally we simulate variation post-LGT, which may include deletion. From each initial ancestor the simulation generates 256 sequences. Symbols:
<inline-formula id="d33e1388">
<inline-graphic id="d33e1389" xlink:href="srep30308-m6.jpg"></inline-graphic>
</inline-formula>
DNA sequences which are ancestors of the sequence groups.
<inline-formula id="d33e1391">
<inline-graphic id="d33e1392" xlink:href="srep30308-m7.jpg"></inline-graphic>
</inline-formula>
Phylogenetic tree used to generate populations of each group.
<inline-formula id="d33e1394">
<inline-graphic id="d33e1395" xlink:href="srep30308-m8.jpg"></inline-graphic>
</inline-formula>
DNA sequences that constitute groups.
<inline-formula id="d33e1397">
<inline-graphic id="d33e1398" xlink:href="srep30308-m9.jpg"></inline-graphic>
</inline-formula>
LGTs events are added between them. LGT between two sequences.
<inline-formula id="d33e1400">
<inline-graphic id="d33e1401" xlink:href="srep30308-m10.jpg"></inline-graphic>
</inline-formula>
Phylogenetic tree on which the evolutionary process post-LGT is simulated. This process tends to obscure the LGT events. Branch length determines the ‘age’ of the LGT events. Regions of the sequences may be deleted at this step.
<inline-formula id="d33e1404">
<inline-graphic id="d33e1405" xlink:href="srep30308-m11.jpg"></inline-graphic>
</inline-formula>
DNA sequences generated by the simulation.</p>
</caption>
<graphic xlink:href="srep30308-f10"></graphic>
</fig>
<fig id="f11">
<label>Figure 11</label>
<caption>
<title>An example of simulated and inferred LGTs.</title>
<p>The
<italic>x</italic>
-axis displays the nucleotide position, and the
<italic>y</italic>
-axis the sixteen sequences generated in our first (recipient) group. The wide bars show the lateral regions actually simulated, and the narrow black bars the regions inferred as lateral by TF-IDF. Here, variation between groups is 0.1, variation within groups is 0.001, variation post-LGT is 0.01, deletion is zero,
<italic>k</italic>
 = 40 and sequence length is 1000 nt.</p>
</caption>
<graphic xlink:href="srep30308-f11"></graphic>
</fig>
<table-wrap position="float" id="t1">
<label>Table 1</label>
<caption>
<title>Summary of regions in the
<italic>Staphylococcus aureus</italic>
TW20 genome (GenBank NC_017331.1) inferred as lateral by TF-IDF.</title>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="bottom">
<tr>
<th rowspan="2" align="left" valign="top" charoff="50"> </th>
<th colspan="2" align="center" valign="top" charoff="50">2000–3999
<hr></hr>
</th>
<th colspan="2" align="center" valign="top" charoff="50">4000–5999
<hr></hr>
</th>
<th colspan="2" align="center" valign="top" charoff="50">6000+
<hr></hr>
</th>
<th colspan="2" align="center" valign="top" charoff="50">2000+
<hr></hr>
</th>
</tr>
<tr>
<th align="center" valign="top" charoff="50">No.</th>
<th align="center" valign="top" charoff="50">%</th>
<th align="center" valign="top" charoff="50">No.</th>
<th align="center" valign="top" charoff="50">%</th>
<th align="center" valign="top" charoff="50">No.</th>
<th align="center" valign="top" charoff="50">%</th>
<th align="center" valign="top" charoff="50">No.</th>
<th align="center" valign="top" charoff="50">%</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left" valign="top" charoff="50">LGT regions</td>
<td align="center" valign="top" charoff="50">121</td>
<td align="center" valign="top" charoff="50">8.5</td>
<td align="center" valign="top" charoff="50">32</td>
<td align="center" valign="top" charoff="50">2.3</td>
<td align="center" valign="top" charoff="50">20</td>
<td align="center" valign="top" charoff="50">1.4</td>
<td align="center" valign="top" charoff="50">173</td>
<td align="center" valign="top" charoff="50">12.2</td>
</tr>
<tr>
<td align="left" valign="top" charoff="50">Mean size (nt)</td>
<td align="center" valign="top" charoff="50">2797</td>
<td align="center" valign="top" charoff="50"></td>
<td align="center" valign="top" charoff="50">4782</td>
<td align="center" valign="top" charoff="50"></td>
<td align="center" valign="top" charoff="50">29600</td>
<td align="center" valign="top" charoff="50"></td>
<td align="center" valign="top" charoff="50">6263</td>
<td align="center" valign="top" charoff="50"></td>
</tr>
<tr>
<td align="left" valign="top" charoff="50">Median size (nt)</td>
<td align="center" valign="top" charoff="50">2786</td>
<td align="center" valign="top" charoff="50"></td>
<td align="center" valign="top" charoff="50">4727</td>
<td align="center" valign="top" charoff="50"></td>
<td align="center" valign="top" charoff="50">10496</td>
<td align="center" valign="top" charoff="50"></td>
<td align="center" valign="top" charoff="50">3112</td>
<td align="center" valign="top" charoff="50"></td>
</tr>
<tr>
<td align="left" valign="top" charoff="50">Nucleotides</td>
<td align="center" valign="top" charoff="50">338413</td>
<td align="center" valign="top" charoff="50">11.1</td>
<td align="center" valign="top" charoff="50">153009</td>
<td align="center" valign="top" charoff="50">5.0</td>
<td align="center" valign="top" charoff="50">592007</td>
<td align="center" valign="top" charoff="50">19.5</td>
<td align="center" valign="top" charoff="50">1083429</td>
<td align="center" valign="top" charoff="50">35.6</td>
</tr>
<tr>
<td align="left" valign="top" charoff="50">Proteins
<xref ref-type="fn" rid="t1-fn1">1</xref>
</td>
<td align="center" valign="top" charoff="50">405</td>
<td align="center" valign="top" charoff="50">14.6</td>
<td align="center" valign="top" charoff="50">169</td>
<td align="center" valign="top" charoff="50">6.1</td>
<td align="center" valign="top" charoff="50">515</td>
<td align="center" valign="top" charoff="50">18.5</td>
<td align="center" valign="top" charoff="50">1071</td>
<td align="center" valign="top" charoff="50">39.2</td>
</tr>
<tr>
<td align="left" valign="top" charoff="50">Hypothetical proteins</td>
<td align="center" valign="top" charoff="50">116</td>
<td align="center" valign="top" charoff="50">14.3</td>
<td align="center" valign="top" charoff="50">38</td>
<td align="center" valign="top" charoff="50">4.7</td>
<td align="center" valign="top" charoff="50">157</td>
<td align="center" valign="top" charoff="50">19.3</td>
<td align="center" valign="top" charoff="50">311</td>
<td align="center" valign="top" charoff="50">38.3</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="t1-fn1">
<p>Numbers in the top row refer to the length ranges of segments selected for analysis.</p>
</fn>
<fn id="t1-fn2">
<p>
<sup>1</sup>
Protein-coding genes fully or partially contained within a region inferred as lateral by TF-IDF.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</floats-group>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000146 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000146 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:4958984
   |texte=   A novel alignment-free method for detection of lateral genetic transfer based on TF-IDF
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:27453035" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021