Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis

Identifieur interne : 000261 ( Pmc/Corpus ); précédent : 000260; suivant : 000262

Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis

Auteurs : Veronika B. Dubinkina ; Dmitry S. Ischenko ; Vladimir I. Ulyantsev ; Alexander V. Tyakht ; Dmitry G. Alexeev

Source :

RBID : PMC:4715287

Abstract

Background

A rapidly increasing flow of genomic data requires the development of efficient methods for obtaining its compact representation. Feature extraction facilitates classification, clustering and model analysis for testing and refining biological hypotheses. “Shotgun” metagenome is an analytically challenging type of genomic data - containing sequences of all genes from the totality of a complex microbial community. Recently, researchers started to analyze metagenomes using reference-free methods based on the analysis of oligonucleotides (k-mers) frequency spectrum previously applied to isolated genomes. However, little is known about their correlation with the existing approaches for metagenomic feature extraction, as well as the limits of applicability. Here we evaluated a metagenomic pairwise dissimilarity measure based on short k-mer spectrum using the example of human gut microbiota, a biomedically significant object of study.

Results

We developed a method for calculating pairwise dissimilarity (beta-diversity) of “shotgun” metagenomes based on short k-mer spectra (5≤k≤11). The method was validated on simulated metagenomes and further applied to a large collection of human gut metagenomes from the populations of the world (n=281). The k-mer spectrum-based measure was found to behave similarly to one based on mapping to a reference gene catalog, but different from one using a genome catalog. This difference turned out to be associated with a significant presence of viral reads in a number of metagenomes. Simulations showed limited impact of bacterial genetic variability as well as sequencing errors on k-mer spectra. Specific differences between the datasets from individual populations were identified.

Conclusions

Our approach allows rapid estimation of pairwise dissimilarity between metagenomes. Though we applied this technique to gut microbiota, it should be useful for arbitrary metagenomes, even metagenomes with novel microbiota. Dissimilarity measure based on k-mer spectrum provides a wider perspective in comparison with the ones based on the alignment against reference sequence sets. It helps not to miss possible outstanding features of metagenomic composition, particularly related to the presence of an unknown bacteria, virus or eukaryote, as well as to technical artifacts (sample contamination, reads of non-biological origin, etc.) at the early stages of bioinformatic analysis. Our method is complementary to reference-based approaches and can be easily integrated into metagenomic analysis pipelines.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-015-0875-7) contains supplementary material, which is available to authorized users.


Url:
DOI: 10.1186/s12859-015-0875-7
PubMed: 26774270
PubMed Central: 4715287

Links to Exploration step

PMC:4715287

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Assessment of
<italic>k</italic>
-mer spectrum applicability for metagenomic dissimilarity analysis</title>
<author>
<name sortKey="Dubinkina, Veronika B" sort="Dubinkina, Veronika B" uniqKey="Dubinkina V" first="Veronika B." last="Dubinkina">Veronika B. Dubinkina</name>
<affiliation>
<nlm:aff id="Aff1">Research Institute of Physico-Chemical Medicine, Malaya Pirogovskaya, Moscow, 119435 Russia</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">Moscow Institute of Physics and Technology (State University), Institutskiy per., Dolgoprudny, 141700 Russia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ischenko, Dmitry S" sort="Ischenko, Dmitry S" uniqKey="Ischenko D" first="Dmitry S." last="Ischenko">Dmitry S. Ischenko</name>
<affiliation>
<nlm:aff id="Aff1">Research Institute of Physico-Chemical Medicine, Malaya Pirogovskaya, Moscow, 119435 Russia</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">Moscow Institute of Physics and Technology (State University), Institutskiy per., Dolgoprudny, 141700 Russia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ulyantsev, Vladimir I" sort="Ulyantsev, Vladimir I" uniqKey="Ulyantsev V" first="Vladimir I." last="Ulyantsev">Vladimir I. Ulyantsev</name>
<affiliation>
<nlm:aff id="Aff3">ITMO University, Kronverksky Pr., St. Petersburg, 197101 Russia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Tyakht, Alexander V" sort="Tyakht, Alexander V" uniqKey="Tyakht A" first="Alexander V." last="Tyakht">Alexander V. Tyakht</name>
<affiliation>
<nlm:aff id="Aff1">Research Institute of Physico-Chemical Medicine, Malaya Pirogovskaya, Moscow, 119435 Russia</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">Moscow Institute of Physics and Technology (State University), Institutskiy per., Dolgoprudny, 141700 Russia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Alexeev, Dmitry G" sort="Alexeev, Dmitry G" uniqKey="Alexeev D" first="Dmitry G." last="Alexeev">Dmitry G. Alexeev</name>
<affiliation>
<nlm:aff id="Aff1">Research Institute of Physico-Chemical Medicine, Malaya Pirogovskaya, Moscow, 119435 Russia</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">Moscow Institute of Physics and Technology (State University), Institutskiy per., Dolgoprudny, 141700 Russia</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">26774270</idno>
<idno type="pmc">4715287</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4715287</idno>
<idno type="RBID">PMC:4715287</idno>
<idno type="doi">10.1186/s12859-015-0875-7</idno>
<date when="2016">2016</date>
<idno type="wicri:Area/Pmc/Corpus">000261</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000261</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Assessment of
<italic>k</italic>
-mer spectrum applicability for metagenomic dissimilarity analysis</title>
<author>
<name sortKey="Dubinkina, Veronika B" sort="Dubinkina, Veronika B" uniqKey="Dubinkina V" first="Veronika B." last="Dubinkina">Veronika B. Dubinkina</name>
<affiliation>
<nlm:aff id="Aff1">Research Institute of Physico-Chemical Medicine, Malaya Pirogovskaya, Moscow, 119435 Russia</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">Moscow Institute of Physics and Technology (State University), Institutskiy per., Dolgoprudny, 141700 Russia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ischenko, Dmitry S" sort="Ischenko, Dmitry S" uniqKey="Ischenko D" first="Dmitry S." last="Ischenko">Dmitry S. Ischenko</name>
<affiliation>
<nlm:aff id="Aff1">Research Institute of Physico-Chemical Medicine, Malaya Pirogovskaya, Moscow, 119435 Russia</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">Moscow Institute of Physics and Technology (State University), Institutskiy per., Dolgoprudny, 141700 Russia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ulyantsev, Vladimir I" sort="Ulyantsev, Vladimir I" uniqKey="Ulyantsev V" first="Vladimir I." last="Ulyantsev">Vladimir I. Ulyantsev</name>
<affiliation>
<nlm:aff id="Aff3">ITMO University, Kronverksky Pr., St. Petersburg, 197101 Russia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Tyakht, Alexander V" sort="Tyakht, Alexander V" uniqKey="Tyakht A" first="Alexander V." last="Tyakht">Alexander V. Tyakht</name>
<affiliation>
<nlm:aff id="Aff1">Research Institute of Physico-Chemical Medicine, Malaya Pirogovskaya, Moscow, 119435 Russia</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">Moscow Institute of Physics and Technology (State University), Institutskiy per., Dolgoprudny, 141700 Russia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Alexeev, Dmitry G" sort="Alexeev, Dmitry G" uniqKey="Alexeev D" first="Dmitry G." last="Alexeev">Dmitry G. Alexeev</name>
<affiliation>
<nlm:aff id="Aff1">Research Institute of Physico-Chemical Medicine, Malaya Pirogovskaya, Moscow, 119435 Russia</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">Moscow Institute of Physics and Technology (State University), Institutskiy per., Dolgoprudny, 141700 Russia</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2016">2016</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>A rapidly increasing flow of genomic data requires the development of efficient methods for obtaining its compact representation. Feature extraction facilitates classification, clustering and model analysis for testing and refining biological hypotheses. “Shotgun” metagenome is an analytically challenging type of genomic data - containing sequences of all genes from the totality of a complex microbial community. Recently, researchers started to analyze metagenomes using reference-free methods based on the analysis of oligonucleotides (
<italic>k-</italic>
mers) frequency spectrum previously applied to isolated genomes. However, little is known about their correlation with the existing approaches for metagenomic feature extraction, as well as the limits of applicability. Here we evaluated a metagenomic pairwise dissimilarity measure based on short
<italic>k-</italic>
mer spectrum using the example of human gut microbiota, a biomedically significant object of study.</p>
</sec>
<sec>
<title>Results</title>
<p>We developed a method for calculating pairwise dissimilarity (beta-diversity) of “shotgun” metagenomes based on short
<italic>k-</italic>
mer spectra (5≤
<italic>k</italic>
≤11). The method was validated on simulated metagenomes and further applied to a large collection of human gut metagenomes from the populations of the world (
<italic>n</italic>
=281). The
<italic>k-</italic>
mer spectrum-based measure was found to behave similarly to one based on mapping to a reference gene catalog, but different from one using a genome catalog. This difference turned out to be associated with a significant presence of viral reads in a number of metagenomes. Simulations showed limited impact of bacterial genetic variability as well as sequencing errors on
<italic>k-</italic>
mer spectra. Specific differences between the datasets from individual populations were identified.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>Our approach allows rapid estimation of pairwise dissimilarity between metagenomes. Though we applied this technique to gut microbiota, it should be useful for arbitrary metagenomes, even metagenomes with novel microbiota. Dissimilarity measure based on
<italic>k-</italic>
mer spectrum provides a wider perspective in comparison with the ones based on the alignment against reference sequence sets. It helps not to miss possible outstanding features of metagenomic composition, particularly related to the presence of an unknown bacteria, virus or eukaryote, as well as to technical artifacts (sample contamination, reads of non-biological origin, etc.) at the early stages of bioinformatic analysis. Our method is complementary to reference-based approaches and can be easily integrated into metagenomic analysis pipelines.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (doi:10.1186/s12859-015-0875-7) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Dick, Gj" uniqKey="Dick G">GJ Dick</name>
</author>
<author>
<name sortKey="Andersson, Af" uniqKey="Andersson A">AF Andersson</name>
</author>
<author>
<name sortKey="Baker, Bj" uniqKey="Baker B">BJ Baker</name>
</author>
<author>
<name sortKey="Simmons, Sl" uniqKey="Simmons S">SL Simmons</name>
</author>
<author>
<name sortKey="Thomas, Bc" uniqKey="Thomas B">BC Thomas</name>
</author>
<author>
<name sortKey="Yelton, Ap" uniqKey="Yelton A">AP Yelton</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Park, Ej" uniqKey="Park E">EJ Park</name>
</author>
<author>
<name sortKey="Kim, Kh" uniqKey="Kim K">KH Kim</name>
</author>
<author>
<name sortKey="Abell, Gcj" uniqKey="Abell G">GCJ Abell</name>
</author>
<author>
<name sortKey="Kim, Ms" uniqKey="Kim M">MS Kim</name>
</author>
<author>
<name sortKey="Roh, Sw" uniqKey="Roh S">SW Roh</name>
</author>
<author>
<name sortKey="Bae, Jw" uniqKey="Bae J">JW Bae</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Singh, B" uniqKey="Singh B">B Singh</name>
</author>
<author>
<name sortKey="Gautam, Sk" uniqKey="Gautam S">SK Gautam</name>
</author>
<author>
<name sortKey="Verma, V" uniqKey="Verma V">V Verma</name>
</author>
<author>
<name sortKey="Kumar, M" uniqKey="Kumar M">M Kumar</name>
</author>
<author>
<name sortKey="Singh, B" uniqKey="Singh B">B Singh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Morgan, Xc" uniqKey="Morgan X">XC Morgan</name>
</author>
<author>
<name sortKey="Segata, N" uniqKey="Segata N">N Segata</name>
</author>
<author>
<name sortKey="Huttenhower, C" uniqKey="Huttenhower C">C Huttenhower</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Riesenfeld, Cs" uniqKey="Riesenfeld C">CS Riesenfeld</name>
</author>
<author>
<name sortKey="Schloss, Pd" uniqKey="Schloss P">PD Schloss</name>
</author>
<author>
<name sortKey="Handelsman, J" uniqKey="Handelsman J">J Handelsman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lozupone, C" uniqKey="Lozupone C">C Lozupone</name>
</author>
<author>
<name sortKey="Lladser, Me" uniqKey="Lladser M">ME Lladser</name>
</author>
<author>
<name sortKey="Knights, D" uniqKey="Knights D">D Knights</name>
</author>
<author>
<name sortKey="Stombaugh, J" uniqKey="Stombaugh J">J Stombaugh</name>
</author>
<author>
<name sortKey="Knight, R" uniqKey="Knight R">R Knight</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Teeling, H" uniqKey="Teeling H">H Teeling</name>
</author>
<author>
<name sortKey="Glockner, Fo" uniqKey="Glockner F">FO Glöckner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, B" uniqKey="Yang B">B Yang</name>
</author>
<author>
<name sortKey="Peng, Y" uniqKey="Peng Y">Y Peng</name>
</author>
<author>
<name sortKey="Leung, Hc M" uniqKey="Leung H">HC-M Leung</name>
</author>
<author>
<name sortKey="Yiu, Sm" uniqKey="Yiu S">SM Yiu</name>
</author>
<author>
<name sortKey="Chen, Jc" uniqKey="Chen J">JC Chen</name>
</author>
<author>
<name sortKey="Chin, Fy L" uniqKey="Chin F">FY-L Chin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Plaza Onate, F" uniqKey="Plaza Onate F">F Plaza Onate</name>
</author>
<author>
<name sortKey="Batto, Jm" uniqKey="Batto J">JM Batto</name>
</author>
<author>
<name sortKey="Juste, C" uniqKey="Juste C">C Juste</name>
</author>
<author>
<name sortKey="Fadlallah, J" uniqKey="Fadlallah J">J Fadlallah</name>
</author>
<author>
<name sortKey="Fougeroux, C" uniqKey="Fougeroux C">C Fougeroux</name>
</author>
<author>
<name sortKey="Gouas, D" uniqKey="Gouas D">D Gouas</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhou, F" uniqKey="Zhou F">F Zhou</name>
</author>
<author>
<name sortKey="Olman, V" uniqKey="Olman V">V Olman</name>
</author>
<author>
<name sortKey="Xu, Y" uniqKey="Xu Y">Y Xu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pride, Dt" uniqKey="Pride D">DT Pride</name>
</author>
<author>
<name sortKey="Meinersmann, Rj" uniqKey="Meinersmann R">RJ Meinersmann</name>
</author>
<author>
<name sortKey="Wassenaar, Tm" uniqKey="Wassenaar T">TM Wassenaar</name>
</author>
<author>
<name sortKey="Blaser, Mj" uniqKey="Blaser M">MJ Blaser</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Alsop, Eb" uniqKey="Alsop E">EB Alsop</name>
</author>
<author>
<name sortKey="Raymond, J" uniqKey="Raymond J">J Raymond</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cui, H" uniqKey="Cui H">H Cui</name>
</author>
<author>
<name sortKey="Zhang, X" uniqKey="Zhang X">X Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Silva, Ggz" uniqKey="Silva G">GGZ Silva</name>
</author>
<author>
<name sortKey="Cuevas, Da" uniqKey="Cuevas D">DA Cuevas</name>
</author>
<author>
<name sortKey="Dutilh, Be" uniqKey="Dutilh B">BE Dutilh</name>
</author>
<author>
<name sortKey="Edwards, Ra" uniqKey="Edwards R">RA Edwards</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Langenk Mper, D" uniqKey="Langenk Mper D">D Langenkämper</name>
</author>
<author>
<name sortKey="Goesmann, A" uniqKey="Goesmann A">A Goesmann</name>
</author>
<author>
<name sortKey="Nattkemper, Tw" uniqKey="Nattkemper T">TW Nattkemper</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liao, R" uniqKey="Liao R">R Liao</name>
</author>
<author>
<name sortKey="Zhang, R" uniqKey="Zhang R">R Zhang</name>
</author>
<author>
<name sortKey="Guan, J" uniqKey="Guan J">J Guan</name>
</author>
<author>
<name sortKey="Zhou, S" uniqKey="Zhou S">S Zhou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Seth, S" uniqKey="Seth S">S Seth</name>
</author>
<author>
<name sortKey="V Lim Ki, N" uniqKey="V Lim Ki N">N Välimäki</name>
</author>
<author>
<name sortKey="Kaski, S" uniqKey="Kaski S">S Kaski</name>
</author>
<author>
<name sortKey="Honkela, A" uniqKey="Honkela A">A Honkela</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ames, Sk" uniqKey="Ames S">SK Ames</name>
</author>
<author>
<name sortKey="Hysom, Da" uniqKey="Hysom D">DA Hysom</name>
</author>
<author>
<name sortKey="Gardner, Sn" uniqKey="Gardner S">SN Gardner</name>
</author>
<author>
<name sortKey="Lloyd, Gs" uniqKey="Lloyd G">GS Lloyd</name>
</author>
<author>
<name sortKey="Gokhale, Mb" uniqKey="Gokhale M">MB Gokhale</name>
</author>
<author>
<name sortKey="Allen, Je" uniqKey="Allen J">JE Allen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wu, Yw" uniqKey="Wu Y">YW Wu</name>
</author>
<author>
<name sortKey="Ye, Y" uniqKey="Ye Y">Y Ye</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jiang, B" uniqKey="Jiang B">B Jiang</name>
</author>
<author>
<name sortKey="Song, K" uniqKey="Song K">K Song</name>
</author>
<author>
<name sortKey="Ren, J" uniqKey="Ren J">J Ren</name>
</author>
<author>
<name sortKey="Deng, M" uniqKey="Deng M">M Deng</name>
</author>
<author>
<name sortKey="Sun, F" uniqKey="Sun F">F Sun</name>
</author>
<author>
<name sortKey="Zhang, X" uniqKey="Zhang X">X Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wang, Y" uniqKey="Wang Y">Y Wang</name>
</author>
<author>
<name sortKey="Liu, L" uniqKey="Liu L">L Liu</name>
</author>
<author>
<name sortKey="Chen, L" uniqKey="Chen L">L Chen</name>
</author>
<author>
<name sortKey="Chen, T" uniqKey="Chen T">T Chen</name>
</author>
<author>
<name sortKey="Sun, F" uniqKey="Sun F">F Sun</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vinga, S" uniqKey="Vinga S">S Vinga</name>
</author>
<author>
<name sortKey="Almeida, J" uniqKey="Almeida J">J Almeida</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G Marçais</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Audano, P" uniqKey="Audano P">P Audano</name>
</author>
<author>
<name sortKey="Vannberg, F" uniqKey="Vannberg F">F Vannberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="B Ckhed, F" uniqKey="B Ckhed F">F Bäckhed</name>
</author>
<author>
<name sortKey="Ley, Re" uniqKey="Ley R">RE Ley</name>
</author>
<author>
<name sortKey="Sonnenburg, Jl" uniqKey="Sonnenburg J">JL Sonnenburg</name>
</author>
<author>
<name sortKey="Peterson, Da" uniqKey="Peterson D">DA Peterson</name>
</author>
<author>
<name sortKey="Gordon, Ji" uniqKey="Gordon J">JI Gordon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Richter, Dc" uniqKey="Richter D">DC Richter</name>
</author>
<author>
<name sortKey="Ott, F" uniqKey="Ott F">F Ott</name>
</author>
<author>
<name sortKey="Auch, Af" uniqKey="Auch A">AF Auch</name>
</author>
<author>
<name sortKey="Schmid, R" uniqKey="Schmid R">R Schmid</name>
</author>
<author>
<name sortKey="Huson, Dh" uniqKey="Huson D">DH Huson</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Qin, J" uniqKey="Qin J">J Qin</name>
</author>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
<author>
<name sortKey="Cai, Z" uniqKey="Cai Z">Z Cai</name>
</author>
<author>
<name sortKey="Li, S" uniqKey="Li S">S Li</name>
</author>
<author>
<name sortKey="Zhu, J" uniqKey="Zhu J">J Zhu</name>
</author>
<author>
<name sortKey="Zhang, F" uniqKey="Zhang F">F Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pearson, Wr" uniqKey="Pearson W">WR Pearson</name>
</author>
<author>
<name sortKey="Wood, T" uniqKey="Wood T">T Wood</name>
</author>
<author>
<name sortKey="Zhang, Z" uniqKey="Zhang Z">Z Zhang</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tyakht, Av" uniqKey="Tyakht A">AV Tyakht</name>
</author>
<author>
<name sortKey="Kostryukova, Es" uniqKey="Kostryukova E">ES Kostryukova</name>
</author>
<author>
<name sortKey="Popenko, As" uniqKey="Popenko A">AS Popenko</name>
</author>
<author>
<name sortKey="Belenikin, Ms" uniqKey="Belenikin M">MS Belenikin</name>
</author>
<author>
<name sortKey="Pavlenko, Av" uniqKey="Pavlenko A">AV Pavlenko</name>
</author>
<author>
<name sortKey="Larin, Ak" uniqKey="Larin A">AK Larin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tatusov, Rl" uniqKey="Tatusov R">RL Tatusov</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Langmead, B" uniqKey="Langmead B">B Langmead</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Buchfink, B" uniqKey="Buchfink B">B Buchfink</name>
</author>
<author>
<name sortKey="Xie, C" uniqKey="Xie C">C Xie</name>
</author>
<author>
<name sortKey="Huson, Dh" uniqKey="Huson D">DH Huson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huson, Dh" uniqKey="Huson D">DH Huson</name>
</author>
<author>
<name sortKey="Auch, Af" uniqKey="Auch A">AF Auch</name>
</author>
<author>
<name sortKey="Qi, J" uniqKey="Qi J">J Qi</name>
</author>
<author>
<name sortKey="Schuster, Sc" uniqKey="Schuster S">SC Schuster</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chor, B" uniqKey="Chor B">B Chor</name>
</author>
<author>
<name sortKey="Horn, D" uniqKey="Horn D">D Horn</name>
</author>
<author>
<name sortKey="Goldman, N" uniqKey="Goldman N">N Goldman</name>
</author>
<author>
<name sortKey="Levy, Y" uniqKey="Levy Y">Y Levy</name>
</author>
<author>
<name sortKey="Massingham, T" uniqKey="Massingham T">T Massingham</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Scholz, Mb" uniqKey="Scholz M">MB Scholz</name>
</author>
<author>
<name sortKey="Lo, Cc" uniqKey="Lo C">CC Lo</name>
</author>
<author>
<name sortKey="Chain, Ps" uniqKey="Chain P">PS Chain</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schloissnig, S" uniqKey="Schloissnig S">S Schloissnig</name>
</author>
<author>
<name sortKey="Arumugam, M" uniqKey="Arumugam M">M Arumugam</name>
</author>
<author>
<name sortKey="Sunagawa, S" uniqKey="Sunagawa S">S Sunagawa</name>
</author>
<author>
<name sortKey="Mitreva, M" uniqKey="Mitreva M">M Mitreva</name>
</author>
<author>
<name sortKey="Tap, J" uniqKey="Tap J">J Tap</name>
</author>
<author>
<name sortKey="Zhu, A" uniqKey="Zhu A">A Zhu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhu, A" uniqKey="Zhu A">A Zhu</name>
</author>
<author>
<name sortKey="Sunagawa, S" uniqKey="Sunagawa S">S Sunagawa</name>
</author>
<author>
<name sortKey="Mende, Dr" uniqKey="Mende D">DR Mende</name>
</author>
<author>
<name sortKey="Bork, P" uniqKey="Bork P">P Bork</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Greenblum, S" uniqKey="Greenblum S">S Greenblum</name>
</author>
<author>
<name sortKey="Carr, R" uniqKey="Carr R">R Carr</name>
</author>
<author>
<name sortKey="Borenstein, E" uniqKey="Borenstein E">E Borenstein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nielsen, Hbr" uniqKey="Nielsen H">HBR Nielsen</name>
</author>
<author>
<name sortKey="Almeida, M" uniqKey="Almeida M">M Almeida</name>
</author>
<author>
<name sortKey="Juncker, As" uniqKey="Juncker A">AS Juncker</name>
</author>
<author>
<name sortKey="Rasmussen, S" uniqKey="Rasmussen S">S Rasmussen</name>
</author>
<author>
<name sortKey="Li, J" uniqKey="Li J">J Li</name>
</author>
<author>
<name sortKey="Sunagawa, S" uniqKey="Sunagawa S">S Sunagawa</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sunagawa, S" uniqKey="Sunagawa S">S Sunagawa</name>
</author>
<author>
<name sortKey="Coelho, Lp" uniqKey="Coelho L">LP Coelho</name>
</author>
<author>
<name sortKey="Chaffron, S" uniqKey="Chaffron S">S Chaffron</name>
</author>
<author>
<name sortKey="Kultima, Jr" uniqKey="Kultima J">JR Kultima</name>
</author>
<author>
<name sortKey="Labadie, K" uniqKey="Labadie K">K Labadie</name>
</author>
<author>
<name sortKey="Salazar, G" uniqKey="Salazar G">G Salazar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Leung, Mhy" uniqKey="Leung M">MHY Leung</name>
</author>
<author>
<name sortKey="Wilkins, D" uniqKey="Wilkins D">D Wilkins</name>
</author>
<author>
<name sortKey="Lee, Pkh" uniqKey="Lee P">PKH Lee</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Minot, S" uniqKey="Minot S">S Minot</name>
</author>
<author>
<name sortKey="Sinha, R" uniqKey="Sinha R">R Sinha</name>
</author>
<author>
<name sortKey="Chen, J" uniqKey="Chen J">J Chen</name>
</author>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Keilbaugh, Sa" uniqKey="Keilbaugh S">SA Keilbaugh</name>
</author>
<author>
<name sortKey="Wu, Gd" uniqKey="Wu G">GD Wu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Reyes, A" uniqKey="Reyes A">A Reyes</name>
</author>
<author>
<name sortKey="Haynes, M" uniqKey="Haynes M">M Haynes</name>
</author>
<author>
<name sortKey="Hanson, N" uniqKey="Hanson N">N Hanson</name>
</author>
<author>
<name sortKey="Angly, Fe" uniqKey="Angly F">FE Angly</name>
</author>
<author>
<name sortKey="Heath, Ac" uniqKey="Heath A">AC Heath</name>
</author>
<author>
<name sortKey="Rohwer, F" uniqKey="Rohwer F">F Rohwer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Modi, Sr" uniqKey="Modi S">SR Modi</name>
</author>
<author>
<name sortKey="Lee, Hh" uniqKey="Lee H">HH Lee</name>
</author>
<author>
<name sortKey="Spina, Cs" uniqKey="Spina C">CS Spina</name>
</author>
<author>
<name sortKey="Collins, Jj" uniqKey="Collins J">JJ Collins</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Segata, N" uniqKey="Segata N">N Segata</name>
</author>
<author>
<name sortKey="Waldron, L" uniqKey="Waldron L">L Waldron</name>
</author>
<author>
<name sortKey="Ballarini, A" uniqKey="Ballarini A">A Ballarini</name>
</author>
<author>
<name sortKey="Narasimhan, V" uniqKey="Narasimhan V">V Narasimhan</name>
</author>
<author>
<name sortKey="Jousson, O" uniqKey="Jousson O">O Jousson</name>
</author>
<author>
<name sortKey="Huttenhower, C" uniqKey="Huttenhower C">C Huttenhower</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Qin, J" uniqKey="Qin J">J Qin</name>
</author>
<author>
<name sortKey="Li, R" uniqKey="Li R">R Li</name>
</author>
<author>
<name sortKey="Raes, J" uniqKey="Raes J">J Raes</name>
</author>
<author>
<name sortKey="Arumugam, M" uniqKey="Arumugam M">M Arumugam</name>
</author>
<author>
<name sortKey="Burgdorf, Ks" uniqKey="Burgdorf K">KS Burgdorf</name>
</author>
<author>
<name sortKey="Manichanh, C" uniqKey="Manichanh C">C Manichanh</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">26774270</article-id>
<article-id pub-id-type="pmc">4715287</article-id>
<article-id pub-id-type="publisher-id">875</article-id>
<article-id pub-id-type="doi">10.1186/s12859-015-0875-7</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Assessment of
<italic>k</italic>
-mer spectrum applicability for metagenomic dissimilarity analysis</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Dubinkina</surname>
<given-names>Veronika B.</given-names>
</name>
<address>
<email>dubinkina@phystech.edu</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
<xref ref-type="aff" rid="Aff2"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Ischenko</surname>
<given-names>Dmitry S.</given-names>
</name>
<address>
<email>ischenko.dmitry@gmail.com</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
<xref ref-type="aff" rid="Aff2"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Ulyantsev</surname>
<given-names>Vladimir I.</given-names>
</name>
<address>
<email>vl.ulyantsev@gmail.com</email>
</address>
<xref ref-type="aff" rid="Aff3"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Tyakht</surname>
<given-names>Alexander V.</given-names>
</name>
<address>
<email>at@niifhm.ru</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
<xref ref-type="aff" rid="Aff2"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Alexeev</surname>
<given-names>Dmitry G.</given-names>
</name>
<address>
<email>alexeev@niifhm.ru</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
<xref ref-type="aff" rid="Aff2"></xref>
</contrib>
<aff id="Aff1">
<label></label>
Research Institute of Physico-Chemical Medicine, Malaya Pirogovskaya, Moscow, 119435 Russia</aff>
<aff id="Aff2">
<label></label>
Moscow Institute of Physics and Technology (State University), Institutskiy per., Dolgoprudny, 141700 Russia</aff>
<aff id="Aff3">
<label></label>
ITMO University, Kronverksky Pr., St. Petersburg, 197101 Russia</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>16</day>
<month>1</month>
<year>2016</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>16</day>
<month>1</month>
<year>2016</year>
</pub-date>
<pub-date pub-type="collection">
<year>2016</year>
</pub-date>
<volume>17</volume>
<elocation-id>38</elocation-id>
<history>
<date date-type="received">
<day>7</day>
<month>9</month>
<year>2015</year>
</date>
<date date-type="accepted">
<day>14</day>
<month>12</month>
<year>2015</year>
</date>
</history>
<permissions>
<copyright-statement>© Dubinkina et al. 2016</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p>A rapidly increasing flow of genomic data requires the development of efficient methods for obtaining its compact representation. Feature extraction facilitates classification, clustering and model analysis for testing and refining biological hypotheses. “Shotgun” metagenome is an analytically challenging type of genomic data - containing sequences of all genes from the totality of a complex microbial community. Recently, researchers started to analyze metagenomes using reference-free methods based on the analysis of oligonucleotides (
<italic>k-</italic>
mers) frequency spectrum previously applied to isolated genomes. However, little is known about their correlation with the existing approaches for metagenomic feature extraction, as well as the limits of applicability. Here we evaluated a metagenomic pairwise dissimilarity measure based on short
<italic>k-</italic>
mer spectrum using the example of human gut microbiota, a biomedically significant object of study.</p>
</sec>
<sec>
<title>Results</title>
<p>We developed a method for calculating pairwise dissimilarity (beta-diversity) of “shotgun” metagenomes based on short
<italic>k-</italic>
mer spectra (5≤
<italic>k</italic>
≤11). The method was validated on simulated metagenomes and further applied to a large collection of human gut metagenomes from the populations of the world (
<italic>n</italic>
=281). The
<italic>k-</italic>
mer spectrum-based measure was found to behave similarly to one based on mapping to a reference gene catalog, but different from one using a genome catalog. This difference turned out to be associated with a significant presence of viral reads in a number of metagenomes. Simulations showed limited impact of bacterial genetic variability as well as sequencing errors on
<italic>k-</italic>
mer spectra. Specific differences between the datasets from individual populations were identified.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>Our approach allows rapid estimation of pairwise dissimilarity between metagenomes. Though we applied this technique to gut microbiota, it should be useful for arbitrary metagenomes, even metagenomes with novel microbiota. Dissimilarity measure based on
<italic>k-</italic>
mer spectrum provides a wider perspective in comparison with the ones based on the alignment against reference sequence sets. It helps not to miss possible outstanding features of metagenomic composition, particularly related to the presence of an unknown bacteria, virus or eukaryote, as well as to technical artifacts (sample contamination, reads of non-biological origin, etc.) at the early stages of bioinformatic analysis. Our method is complementary to reference-based approaches and can be easily integrated into metagenomic analysis pipelines.</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (doi:10.1186/s12859-015-0875-7) contains supplementary material, which is available to authorized users.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>
<italic>k-</italic>
mer</kwd>
<kwd>
<italic>n-</italic>
gram</kwd>
<kwd>
<italic>l-</italic>
tuple</kwd>
<kwd>Sequence signature</kwd>
<kwd>Gut metagenome</kwd>
<kwd>Phage</kwd>
<kwd>Reference-free metagenomic analysis</kwd>
<kwd>Genomic variability</kwd>
<kwd>Mapping bias</kwd>
</kwd-group>
<funding-group>
<award-group>
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/501100003443</institution-id>
<institution>Ministry of Education and Science of the Russian Federation (RU)</institution>
</institution-wrap>
</funding-source>
<award-id>RFMEFI60414X0119</award-id>
<principal-award-recipient>
<name>
<surname>Alexeev</surname>
<given-names>Dmitry G.</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group>
<funding-source>
<institution>Government of Russian Federation</institution>
</funding-source>
<award-id>Grant 074-U01</award-id>
<principal-award-recipient>
<name>
<surname>Ulyantsev</surname>
<given-names>Vladimir I.</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group>
<funding-source>
<institution>Russian Scientific Foundation</institution>
</funding-source>
<award-id>project #15-14-00066</award-id>
<principal-award-recipient>
<name>
<surname>Tyakht</surname>
<given-names>Alexander V.</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group>
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/501100003443</institution-id>
<institution>Ministry of Education and Science of the Russian Federation (RU)</institution>
</institution-wrap>
</funding-source>
<award-id>RFMEFI60414X0119</award-id>
<principal-award-recipient>
<name>
<surname>Dubinkina</surname>
<given-names>Veronika B.</given-names>
</name>
</principal-award-recipient>
</award-group>
<award-group>
<funding-source>
<institution>Russian Scientific Foundation</institution>
</funding-source>
<award-id>project #15-14-00066</award-id>
<principal-award-recipient>
<name>
<surname>Ischenko</surname>
<given-names>Dmitry S.</given-names>
</name>
</principal-award-recipient>
</award-group>
</funding-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2016</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Background</title>
<p>During the last decade, metagenomics became one of the explosively developing areas of molecular genomics. Advent of the next-generation sequencing allowed performing genomic analysis of samples obtained directly from the environment. Such an approach provides data for an extensive quantitative examination of the microbial community structure, particularly including uncultivable and previously undiscovered components. The sphere of metagenomic analysis has extended from science to heavy industry [
<xref ref-type="bibr" rid="CR1">1</xref>
], agriculture [
<xref ref-type="bibr" rid="CR2">2</xref>
,
<xref ref-type="bibr" rid="CR3">3</xref>
] and healthcare [
<xref ref-type="bibr" rid="CR4">4</xref>
]. A large amount of metagenomic data is constantly being accumulated which leads to a demand in the means of efficient analysis [
<xref ref-type="bibr" rid="CR5">5</xref>
].</p>
<p>One of the common steps in metagenomic study is calculation of pairwise dissimilarity between the samples (beta-diversity) [
<xref ref-type="bibr" rid="CR6">6</xref>
]. Beta-diversity is a quantitative measure of the differences in composition between two microbial communities. Its value is calculated from features like taxonomic or functional composition, phylogenetic structure of the whole community, etc. A dissimilarity matrix composed of pairwise distances between all samples is used for further cluster analysis, classification and study of influence of the experimental factors. In large-scale studies involving tens and hundreds of metagenomes, critical requirements in beta-diversity analysis include high algorithm performance and low memory usage.</p>
<p>For a long time, the standard technological approach for the evaluation of beta-diversity was based on the identification of species in metagenomic samples through 16S rRNA gene sequencing. However, this method has inherent disadvantages including incompleteness of reference databases, presence of multiple copies of 16S rRNA gene in the same genome, discrepancy between phylogenetic trees constructed using 16S rRNA and the other genes and lack of information about the other genes and subsequently metabolic potential of the studied community. An alternative, more informative method is whole-genome sequencing (WGS, “shotgun”) generating millions of reads from the total DNA of the genomes of all organisms inhabiting the environment. The identification of the organisms in the short-read WGS metagenome is commonly based on the alignment or
<italic>de novo</italic>
assembly [
<xref ref-type="bibr" rid="CR7">7</xref>
]. The alignment method is a comparison between sequences of obtained reads and sequences of reference genes or genomes, and has significant drawbacks such as high computational costs and incomplete databases.
<italic>De novo</italic>
assembly is usually a time-consuming task for such complex data as metagenomes that may contain many unknown or highly similar genomes of organisms with widely varying abundance.</p>
<p>With a rapid increase in data output produced by sequencing technologies, efficient methods for genomic analysis based on
<italic>k-</italic>
mer composition analysis emerged. Such algorithms work with
<italic>k-</italic>
mers (oligonucleotide sequences of length
<italic>k</italic>
, also called
<italic>l-</italic>
tuples or
<italic>n-</italic>
grams) obtained directly from metagenomic reads, without pre-mapping or assembly.</p>
<p>In comparison with reference-based methods, the main advantages of
<italic>k-</italic>
mer based approaches are compressed representation of sequences and inclusion of the entire data volume into analysis (unlike alignment, where only the reads successfully mapped to a reference database influence the result). Among these methods, the most simple and effective for exploratory analysis of large data sets is comparison of sequences by calculation of pairwise dissimilarity between them on the basis of
<italic>k-</italic>
mer spectrum - a normalized vector of frequencies of occurrences of each
<italic>k-</italic>
mer in the metagenome. The
<italic>k-</italic>
mer length is a key factor influencing specificity and efficiency of the comparison. For different intervals of
<italic>k</italic>
, the respective algorithms have been designed that target different specific tasks. For example, for very short
<italic>k-</italic>
mers (
<italic>k</italic>
=4−7) only “rough” estimates are possible: sequence quality check [
<xref ref-type="bibr" rid="CR8">8</xref>
,
<xref ref-type="bibr" rid="CR9">9</xref>
], taxonomic separation of individual genomes [
<xref ref-type="bibr" rid="CR10">10</xref>
<xref ref-type="bibr" rid="CR12">12</xref>
] or comparison of metagenomic communities with notably different composition [
<xref ref-type="bibr" rid="CR13">13</xref>
<xref ref-type="bibr" rid="CR16">16</xref>
]. For
<italic>k</italic>
=15−30, the computational costs associated with the processing of the whole spectrum increase significantly. Two approaches can be applied to reduce them. This problem can be solved in two ways. The first is to select a fraction of
<italic>k-</italic>
mers that describe the studied data in the most complete way (feature extraction) [
<xref ref-type="bibr" rid="CR17">17</xref>
]. Another way is to combine multiple
<italic>k-</italic>
mers into one feature using a certain principle (different approaches of
<italic>k-</italic>
mer binning) [
<xref ref-type="bibr" rid="CR18">18</xref>
,
<xref ref-type="bibr" rid="CR19">19</xref>
]. In the intermediate range (
<italic>k</italic>
=7−12), it is still computationally feasible to analyze the complete set of
<italic>k-</italic>
mers, and the
<italic>k-</italic>
mer length remains sufficiently specific [
<xref ref-type="bibr" rid="CR12">12</xref>
,
<xref ref-type="bibr" rid="CR20">20</xref>
] for comparing distinct genomes.</p>
<p>Among the metagenomic
<italic>k-</italic>
mer methods, the most simple and effective for exploratory analysis of large data sets is comparison of sequences by calculation of pairwise distances between them on the basis of
<italic>k-</italic>
mer spectra [
<xref ref-type="bibr" rid="CR21">21</xref>
,
<xref ref-type="bibr" rid="CR22">22</xref>
]. In this area, researchers are actively designing fast algorithms for calculating and assessing the
<italic>k-</italic>
mer spectrum [
<xref ref-type="bibr" rid="CR23">23</xref>
,
<xref ref-type="bibr" rid="CR24">24</xref>
]. Most studies are focused on examining clustering of the samples by one or more factors (geography, nutrition, clinical status, etc.) [
<xref ref-type="bibr" rid="CR17">17</xref>
,
<xref ref-type="bibr" rid="CR21">21</xref>
].</p>
<p>Since the prevalent approaches for assessment of beta-diversity today are reference-based, an important question is how their results correlate with the
<italic>k-</italic>
mer methods. In this study, we compared common reference-based methods (based on taxonomic and gene composition, including phylogeny-aware methods) with the
<italic>k-</italic>
mer approach. We explored how various characteristics of the data influence the results of
<italic>k-</italic>
mer spectra analysis and identified the advantages of
<italic>k-</italic>
mer analysis comparing with the reference-based approaches. To evaluate the applicability of
<italic>k-</italic>
mer-based dissimilarity, metagenome of human gut microbiota was selected, the study of which has great biomedical importance and perspective. Although nowadays intestinal microbiota is one of the most studied among complex microbial communities, many of its components are still not fully identified (among them are uncultured bacteria, phages, fungi and protozoa) [
<xref ref-type="bibr" rid="CR25">25</xref>
]. The application of our method revealed significant presence of one of such components - phage - that went undetected by reference-based methods.</p>
</sec>
<sec id="Sec2">
<title>Methods</title>
<sec id="Sec3">
<title>Simulated metagenomes</title>
<p>Two set of “shotgun” gut metagenomes were simulated using MetaSim [
<xref ref-type="bibr" rid="CR26">26</xref>
]. The high-diversity set included 100 metagenomes generated from the genomes of ten distantly related major bacterial species accounting for more than 90 % of all reads in Chinese group:
<italic>Akkermansia muciniphila</italic>
ATCC BAA-835,
<italic>Alistipes shahii</italic>
WAL 8301,
<italic>Bacteroides vulgatus</italic>
ATCC 8482,
<italic>Bifidobacterium adolescentis</italic>
ATCC 15703,
<italic>Coprococcus sp.</italic>
ART55/1,
<italic>Eubacterium eligens</italic>
ATCC 27750,
<italic>Faecalibacterium prausnitzii</italic>
L2-6,
<italic>Lachnospiraceae bacterium</italic>
1 4 56FAA,
<italic>Prevotella copri</italic>
DSM 18205 and
<italic>Ruminococcus sp.</italic>
18P13.</p>
<p>The simulation included the following steps. First, for each genome, mean and standard deviation of its relative abundance were estimated from the taxonomic composition of the Chinese metagenomes. For each metagenome, ten abundance values were randomly generated under normal distribution with these parameters and the obtained values were normalized to 1 million reads; a total of 100 genera abundance vectors were obtained (see Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S1). The metagenomes were generated by mixing ten bacterial genomes at the obtained abundance levels and sampling short reads from the genomes using MetaSim with read length 100 bp. Also we performed sampling of these reads with errors (1 % - probability of error in each base).</p>
<p>The low-diversity simulated group included 100 metagenomes generated in a similar way from the genomes of ten closely related major bacterial species accounting for more than 90 % of all reads in the HMP group:
<italic>Bacteroides vulgatus</italic>
ATCC 8482,
<italic>Bacteroides dorei</italic>
5 1 36/D4,
<italic>Bacteroides uniformis</italic>
ATCC 8492,
<italic>Bacteroides stercoris</italic>
ATCC 43183,
<italic>Bacteroides caccae</italic>
ATCC 43185,
<italic>Bacteroides ovatus</italic>
(strains SD CMC, ATCC 8483 and 3 8 47FAA),
<italic>Bacteroides xylanisolvens</italic>
XB1A and
<italic>Bacteroides thetaiotaomicron</italic>
VPI-5482. Bacterial proportions for these simulations are listed in Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S2.</p>
<p>For single nucleotide polymorphism (SNP) simulations, the same ten reference genomes and abundance values as in the high-diversity dataset were used. Two different models of SNPs introduction were used: “independent” and “phylogenetic”.</p>
<p>In the “independent” SNP model, 64 “mutated” genomes were generated for each reference species by changing nucleotide letter at random positions independently with 0.5 % substitution rate. Thus, the average amount of SNPs between any two of the “mutated” genomes was ∼ 1 %.</p>
<p>In “phylogenetic” SNP model, the procedure was performed in iterations for each reference genome:</p>
<p>
<list list-type="alpha-lower">
<list-item>
<p>Initialize with a single genome; iteration number = 1.</p>
</list-item>
<list-item>
<p>Make a copy of each of the genomes available at the step.</p>
</list-item>
<list-item>
<p>Introduce SNPs to all genomes at random positions.</p>
</list-item>
<list-item>
<p>Increment iteration number.</p>
</list-item>
<list-item>
<p>If the iteration number is greater than 6, stop; else return to step b.</p>
</list-item>
</list>
</p>
<p>After the 6 iterations, 2
<sup>6</sup>
=64 “mutated” genomes are obtained.</p>
<p>In each model, the random “mutated” genomes of corresponding bacteria were used to generate metagenomes the same way as for high-diversity simulation above.</p>
</sec>
<sec id="Sec4">
<title>Real metagenomic datasets</title>
<p>Two “shotgun” gut metagenomic datasets were analyzed: 129 metagenomes of healthy USA population [
<xref ref-type="bibr" rid="CR27">27</xref>
] (referred to as HMP, Illumina platform, read length 101 bp) and 152 metagenomes of Chinese population [
<xref ref-type="bibr" rid="CR28">28</xref>
] including healthy and type 2 diabetes individuals (referred to as China, Illumina platform, read length 90 bp). For each sample, the reads were filtered by quality using FASTQ Quality Filter script from FASTX-Toolkit [
<xref ref-type="bibr" rid="CR29">29</xref>
] (threshold
<italic>Q</italic>
<italic>V</italic>
≥30 for each nucleotide in a read). For each metagenome, 1 million of high-quality reads was sampled using random _records script from [
<xref ref-type="bibr" rid="CR30">30</xref>
]. Comparison of various sampling sizes showed that the selected size of subsampling does not significantly affect the results of the measures’ comparison (see Additional file
<xref rid="MOESM2" ref-type="media">2</xref>
: Figure S5).</p>
</sec>
<sec id="Sec5">
<title>Calculation of
<italic>k-</italic>
mer vectors and dissimilarity measures</title>
<p>For each metagenome,
<italic>k-</italic>
mer spectrum was calculated using an
<italic>ad hoc</italic>
Java program that processes FASTA files read-wise by obtaining
<italic>k-</italic>
mer counts for each read and adding the counts to a global array (the value of
<italic>k</italic>
is limited to 15 due to memory consumption). After processing all reads, the counts for reverse-complementary
<italic>k-</italic>
mers were summed and normalized to a sum of 1. The length of the final feature vector (spectrum) did not exceed
<italic>n</italic>
=2
<sup>2
<italic>k</italic>
−1</sup>
for odd
<italic>k</italic>
and
<italic>n</italic>
=2
<sup>2
<italic>k</italic>
−1</sup>
+2
<sup>
<italic>k</italic>
−1</sup>
for even
<italic>k</italic>
because of reverse-complement
<italic>k-</italic>
mers.</p>
<p>The obtained
<italic>k-</italic>
mer spectra were used to calculate pairwise dissimilarity via Bray-Curtis measure defined as:
<disp-formula id="Equ1">
<label>(1)</label>
<alternatives>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{@{}rcl@{}} BC(x,y)=1-\frac{2\sum\limits_{i=1}^{4^{k}}\min(m_{i}(x),m_{i}(y))}{\sum\limits_{i=1}^{4^{k}}(m_{i}(x)+m_{i}(y))} \end{array} $$ \end{document}</tex-math>
<mml:math id="M2">
<mml:mtable class="eqnarray" columnalign="left center right">
<mml:mtr>
<mml:mtd class="eqnarray-1">
<mml:mtext mathvariant="italic">BC</mml:mtext>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>y</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:munderover>
<mml:mo>min</mml:mo>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>y</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:munderover>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>(</mml:mo>
<mml:mi>y</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<graphic xlink:href="12859_2015_875_Article_Equ1.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p>where m is the vector of
<italic>k-</italic>
mer frequencies normalized to a sum of 1 per metagenome and x, y are two different metagenomes. BC = 0, if the frequencies are equal for all
<italic>k-</italic>
mers between the metagenomes, and BC = 1, if no common
<italic>k-</italic>
mers are present in the metagenomes.</p>
</sec>
<sec id="Sec6">
<title>Beta-diversity analysis using reference-based methods</title>
<p>Taxonomic profiling via mapping of metagenomes to a reference genome catalog and coverage analysis was performed as described previously [
<xref ref-type="bibr" rid="CR31">31</xref>
], with the only difference: a non-redundant set of 353 genomes of gut microbes was used (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S3). The final feature vector for each metagenome included relative abundance of microbial species was normalized to a sum of 100 % (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Tables S4 and S5). Dissimilarity was calculated using these vectors both with Bray-Curtis measure and whole-genome adaptation of the weighted UniFrac metric [
<xref ref-type="bibr" rid="CR31">31</xref>
]. Functional profiling was performed as described previously [
<xref ref-type="bibr" rid="CR31">31</xref>
] to yield COG (Clusters of Orthologous Groups) [
<xref ref-type="bibr" rid="CR32">32</xref>
] relative abundance vectors subsequently used for dissimilarity analysis using Bray-Curtis measure (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S6).</p>
<p>An alternative method of taxonomic profiling employed MetaPhlAn v1.7.7 (parameters: -t rel _ab –tax _lev s(g)); here mapping was performed using Bowtie2 v2.0.2 software [
<xref ref-type="bibr" rid="CR33">33</xref>
], up to 3 mismatches per read were allowed (mapping results and statistics are in Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S7). All reference-based methods were summarized in Table
<xref rid="Tab1" ref-type="table">1</xref>
.
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>Types of reference-based analyses used in the study</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Type of reference-based analysis</th>
<th align="left">Method</th>
<th align="left">Beta-diversity measure</th>
<th align="left">Designation</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Taxonomic profiling</td>
<td align="left">Mapping to a reference catalog</td>
<td align="left">Bray-Curtis</td>
<td align="left">BC TAX (org), (genus)</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">of 353 genomes of intestinal microbiota [
<xref ref-type="bibr" rid="CR31">31</xref>
]</td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr>
<td align="left"></td>
<td align="left"></td>
<td align="left">Whole-genome version of</td>
<td align="left">WG UniFrac</td>
</tr>
<tr>
<td align="left"></td>
<td align="left"></td>
<td align="left">weighted UniFrac</td>
<td align="left"></td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Quantitative profiling of unique</td>
<td align="left">Bray-Curtis</td>
<td align="left">BC MetaPhlAn</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">clade-specific marker genes (MetaPhlAn) [
<xref ref-type="bibr" rid="CR48">48</xref>
]</td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr>
<td align="left">Functional profiling</td>
<td align="left">Mapping to Metahit 3,9M catalog of genes [
<xref ref-type="bibr" rid="CR49">49</xref>
]</td>
<td align="left">Bray-Curtis</td>
<td align="left">BC COG</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">grouped by COGs</td>
<td align="left"></td>
<td align="left"></td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
<sec id="Sec7">
<title>CrAssphage abundance analysis</title>
<p>All crAssphage genes (GenBank: JQ995537.1) [
<xref ref-type="bibr" rid="CR34">34</xref>
] were aligned to the reference gene catalog (similarity criterion:
<italic>e</italic>
<italic>v</italic>
<italic>a</italic>
<italic>l</italic>
<italic>u</italic>
<italic>e</italic>
<1
<italic>E</italic>
−5, percent of identity <80
<italic>%</italic>
, alignment length/query length >0.8, alignment length/subject length >0.8). For each gene, its relative abundance was estimated as a ratio of the total length of the reads mapped to this gene to the total length of the reads mapped to the reference gene catalog. Phage relative abundance was determined as a sum of the relative abundance values of its genes (Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S8). As an additional method of metagenomic classification, DIAMOND aligner [
<xref ref-type="bibr" rid="CR35">35</xref>
] was used (method: BLASTx against nr database with default parameters) in combination with MEGAN classifier [
<xref ref-type="bibr" rid="CR36">36</xref>
].</p>
</sec>
<sec id="Sec8">
<title>Statistical analysis</title>
<p>Statistical analysis was implemented in R. The code is available at:</p>
<p>
<ext-link ext-link-type="uri" xlink:href="http://download.ripcm.com/Dubinkina_2015_suppl_data/">http://download.ripcm.com/Dubinkina_2015_suppl_data/</ext-link>
and
<ext-link ext-link-type="uri" xlink:href="https://github.com/Zireae1/kmer_project/">https://github.com/Zireae1/kmer_project/</ext-link>
.</p>
</sec>
<sec id="Sec9">
<title>Ethical approval</title>
<p>The sampling procedure was approved by the Ethical Committee for Clinical Research from the Peking University Shenzhen Hospital, Shenzhen Second People’s Hospital and Medical Research Center of Guangdong General Hospital (from the [
<xref ref-type="bibr" rid="CR28">28</xref>
] reference), and Enrollments were approved by the Institutional Review Boards of the two recruitment centers (Baylor College of Medicine, Houston, TX and Washington University, St. Louis, MO) (from [
<xref ref-type="bibr" rid="CR27">27</xref>
]).</p>
</sec>
</sec>
<sec id="Sec10" sec-type="results">
<title>Results</title>
<p>To compare the
<italic>k-</italic>
mer based metagenomic beta-diversity measure with traditional reference-based methods we conducted a series of computational experiments on simulated and real data. To validate the
<italic>k-</italic>
mer measure and find an optimal value of
<italic>k</italic>
, we simulated metagenomes from prevalent human gut bacterial genomes. Then the method was applied for the analysis of a group of real human gut metagenomes sequenced in two large-scale projects: China population (n = 152) [
<xref ref-type="bibr" rid="CR28">28</xref>
] and HMP (n = 129) [
<xref ref-type="bibr" rid="CR27">27</xref>
].</p>
<sec id="Sec11">
<title>Comparison of beta-diversity measures for simulated metagenomes</title>
<p>There is considerable variation between
<italic>k-</italic>
mer spectra for genomes of distinct bacterial species due to the differences in the gene content, amino acid coding preferences, etc. [
<xref ref-type="bibr" rid="CR37">37</xref>
]. Supposedly, in the case of a metagenome including sufficiently covered genomes of multiple species, one should observe significant accordance between
<italic>k-</italic>
mer spectrum of the metagenome and its taxonomic composition. To verify this hypothesis, we simulated several datasets with different degrees of community richness and applied the Bray-Curtis measure (a common microbial ecological index) for both taxonomic and
<italic>k-</italic>
mer profiles to compare the two respective dissimilarity matrices (see
<xref rid="Sec2" ref-type="sec">Methods</xref>
for details).</p>
<sec id="Sec12">
<title>Simulation 1: high-diversity communities</title>
<p>The first synthetic dataset included 100 metagenomes generated by randomly sampling “reads” from ten genomes of common phylogenetically distant human gut bacteria (see
<xref rid="Sec2" ref-type="sec">Methods</xref>
for details). Comparison of the two approaches showed that, as
<italic>k</italic>
increases, so does the correlation value between the two dissimilarity matrices based on
<italic>k-</italic>
mers and taxonomic composition. With high values of
<italic>k</italic>
, the two matrices are highly similar (e.g. for
<italic>k</italic>
=10, Mantel test: Spearman correlation
<italic>r</italic>
=0.88,
<italic>p</italic>
=0.001, see Additional file
<xref rid="MOESM3" ref-type="media">3</xref>
: Figure S1 and Additional file
<xref rid="MOESM4" ref-type="media">4</xref>
: Figure S2).</p>
</sec>
<sec id="Sec13">
<title>Simulation 2: sequencing errors and SNPs</title>
<p>Besides the considerable number of species within microbiota, the other factors contributing to the diversity of metagenomic
<italic>k-</italic>
mers are presence of mutations and sequencing errors in reads. Therefore, we conducted two experiments by introducing artificial SNPs into genomes and, separately, random single-nucleotide changes (“sequencing errors”) into the reads in order to explore their influence on the correlation of beta-diversity estimates using
<italic>k-</italic>
mer and taxonomic methods. Datasets from simulation 1 were used here.</p>
<p>For sequencing errors modeling, the reads for each metagenome were simulated with per-nucleotide substitution rate of 1 % (a typical order of value for most modern DNA sequencing platforms [
<xref ref-type="bibr" rid="CR38">38</xref>
]). Introduction of such “errors” did not lead to a significant change in correlation between the two methods (from 0.88 to 0.87, for
<italic>k</italic>
=10).</p>
<p>For SNP modeling, bacterial genomes with 1 % of randomly introduced single-nucleotide substitutions (according to an estimate for gut bacteria [
<xref ref-type="bibr" rid="CR39">39</xref>
]) were used to generate simulated metagenomes with the same abundance proportions as in simulation 1. We employed two different models of SNPs introduction - “independent” and “phylogenetic”. With the former simulation being more straight-forward, “phylogenetic” approach was developed to model the accumulation of mutations in bacterial species during evolutionary process (see
<xref rid="Sec2" ref-type="sec">Methods</xref>
for details). The results of simulations showed that, independent of the model choice, in general SNPs had a minor effect on
<italic>k-</italic>
mer spectra comparable to the effect of simulated sequencing errors: correlation between the
<italic>k-</italic>
mer and taxonomic methods decreased from 0.931 to 0.929 for “independent” model and 0.927 for “phylogenetic” model (Additional file
<xref rid="MOESM3" ref-type="media">3</xref>
: Figure S1A,B). Noteworthy, introduction of SNPs had a more pronounced effect on metagenomes with highly similar taxonomic composition. This was particularly marked when the SNP rate was increased to 10 % (Additional file
<xref rid="MOESM3" ref-type="media">3</xref>
: Figure S1C).</p>
</sec>
<sec id="Sec14">
<title>Simulation 3: low-diversity communities</title>
<p>The second synthetic dataset included 100 metagenomes generated by randomly sampling “reads” from ten genomes of common phylogenetically close human gut bacteria - belonging to the same genus -
<italic>Bacteroides</italic>
(see
<xref rid="Sec2" ref-type="sec">Methods</xref>
for details). The correlation between the methods was found to be lower for such homogeneous community than for a heterogeneous one (for
<italic>k</italic>
= 10, Mantel test: Spearman correlation
<italic>r</italic>
=0.82,
<italic>p</italic>
=0.001). The correlation value tends to increase with
<italic>k</italic>
but does not achieve the level of simulation 1 (for
<italic>k</italic>
=10,
<italic>r</italic>
=0.88, see Additional file
<xref rid="MOESM3" ref-type="media">3</xref>
: Figure S1). It suggests that higher values of
<italic>k</italic>
should be used to increase accuracy; however, the size of the feature vector increases as 4
<sup>
<italic>k</italic>
</sup>
, hence the computational time quickly becomes unacceptable. To select the optimal
<italic>k</italic>
value, we evaluated the correlation between
<italic>k-</italic>
mer and taxonomic dissimilarity matrices together with the computational time of
<italic>k-</italic>
mer matrix generation for
<italic>k</italic>
=5−12 using both high- and low-diversity simulated datasets (see Additional file
<xref rid="MOESM5" ref-type="media">5</xref>
: Figure S3). As the results in both simulations showed, with
<italic>k</italic>
= 11 the dissimilarity matrices are highly correlated while the computational time is still acceptable (on a single computation core, the calculation of
<italic>k-</italic>
mers spectra for one sample took about ten seconds (for comparison the Jellyfish counter [
<xref ref-type="bibr" rid="CR23">23</xref>
] with parameters: -m 11 -s 10000 -t 32 (hash size was optimized for the value of
<italic>k</italic>
) took about 80 seconds to calculate the spectra) Further statistical analysis - calculation of dissimilarity matrix - took about 1 - 10 minutes, see Additional file
<xref rid="MOESM5" ref-type="media">5</xref>
: Figure S3). At the same time, it is the highest value practically acceptable in terms of memory usage: for
<italic>k</italic>
=11, the spectra occupied ∼4 Gb of memory in R environment, but for
<italic>k</italic>
= 12 - as much as 15 Gb. Considering these observations we selected
<italic>k</italic>
=11 for further analyses.</p>
</sec>
</sec>
<sec id="Sec15">
<title>Comparison of beta-diversity measures on real human gut metagenomes</title>
<p>After testing the method on simulated data, we examined two real human gut datasets from large-scale metagenomic projects: China [
<xref ref-type="bibr" rid="CR28">28</xref>
] and HMP [
<xref ref-type="bibr" rid="CR27">27</xref>
], with the former cohort representing more diverse microbial community structures than the latter [
<xref ref-type="bibr" rid="CR31">31</xref>
]. Using this data, the pairwise dissimilarity matrix obtained by the
<italic>k-</italic>
mer approach with Bray-Curtis measure (refered as
<bold>BC kmer</bold>
in the Figures and further in the text) was compared with the dissimilarity matrices obtained by each of the four methods based on taxonomic and functional reference (see Table
<xref rid="Tab1" ref-type="table">1</xref>
).</p>
<p>To visualize the distributions of beta-diversity values, we applied two types of scatter plots. The first type is a basic principle coordinate analysis (PCoA) plot constructed using a single dissimilarity measure, with dots representing distinct metagenomes (e.g. Fig.
<xref rid="Fig1" ref-type="fig">1</xref>
<xref rid="Fig1" ref-type="fig">a</xref>
). On the second type of plot, two dissimilarity measures are compared: each triangle corresponds to a pair of metagenomes, one measure is plotted against the other (Fig.
<xref rid="Fig2" ref-type="fig">2</xref>
<xref rid="Fig2" ref-type="fig">a</xref>
, Additional file
<xref rid="MOESM6" ref-type="media">6</xref>
: Figure S4). Samples from the two studies (China and HMP) tended to cluster separately by functional, as well as by
<italic>k-</italic>
mer composition, but not by taxonomic composition (Fig.
<xref rid="Fig1" ref-type="fig">1</xref>
<xref rid="Fig1" ref-type="fig">a</xref>
,
<xref rid="Fig1" ref-type="fig">b</xref>
,
<xref rid="Fig1" ref-type="fig">c</xref>
,
<xref rid="Fig1" ref-type="fig">d</xref>
). Therefore, the two cohorts were further analyzed separately. Another interesting fact was that 3 of the outliers (all from HMP group) present on
<italic>k-</italic>
mer scatter plot were also on the periphery of COG scatter plot but not of the taxonomic scatter plot (Fig.
<xref rid="Fig1" ref-type="fig">1</xref>
<xref rid="Fig1" ref-type="fig">a</xref>
,
<xref rid="Fig1" ref-type="fig">b</xref>
; outliers marked with asterisks). These samples were examined in details.
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>Variation of metagenomes using different dissimilarity measures. PCoA plots for different dissimilarity measures:
<bold>a</bold>
BC kmer,
<bold>b</bold>
BC COG,
<bold>c</bold>
WG UniFrac,
<bold>d</bold>
BC TAX (org),
<bold>e</bold>
BC MetaPhlAn (org). Three samples-outliers are marked with asterisks.
<bold>f</bold>
Heatmap of Spearman correlation coefficient between dissimilarity matrices obtained using different measures (the upper triangle of matrix represents coefficients for China, the lower - for HMP)</p>
</caption>
<graphic xlink:href="12859_2015_875_Fig1_HTML" id="MO1"></graphic>
</fig>
<fig id="Fig2">
<label>Fig. 2</label>
<caption>
<p>Comparison of pairwise difference measures obtained by
<italic>k-</italic>
mer and reference-based methods. For each plot, Y-axis represents
<italic>k-</italic>
mer distance, X-axis - distance by one of the reference-based methods. Distribution of dissimilarity measures is shown for
<bold>a</bold>
BC kmer for all reads and BC TAX (org);
<bold>b</bold>
BC kmer for all reads and BC COG;
<bold>c</bold>
BC kmer for reads mapped to the catalog of genomes and BC TAX (org);
<bold>d</bold>
BC kmer for reads mapped to the catalog of genes and BC COG</p>
</caption>
<graphic xlink:href="12859_2015_875_Fig2_HTML" id="MO2"></graphic>
</fig>
</p>
<p>Comparison of the five beta-diversity measures showed that the
<italic>k-</italic>
mer measure has a significant similarity with each of the reference-based ones (Mantel test, Spearman correlation,
<italic>p</italic>
<0.01, Fig.
<xref rid="Fig1" ref-type="fig">1</xref>
<xref rid="Fig1" ref-type="fig">f</xref>
). The closest was the measure based on COG composition. For the Chinese group, the correlation values tended to be higher than for the HMP group in all comparisons (Fig.
<xref rid="Fig1" ref-type="fig">1</xref>
<xref rid="Fig1" ref-type="fig">f</xref>
). The phylogeny-aware metric WG UniFrac was among the most dissimilar (
<italic>r</italic>
=0.39 for HMP,
<italic>r</italic>
=0.62 for China).</p>
<sec id="Sec16">
<title>Influence of reads mappability</title>
<p>To assess the contribution of the unmapped reads to the results,
<italic>k-</italic>
mer spectra were also computed only using the reads that successfully mapped to the corresponding catalog (fraction of mapped reads: for HMP group - 49 ± 17 % for genome catalog, 60 ± 5 % for gene catalog, for China - 49 ± 12 %, 61 ± 6 %, respectively; values are given in median ± s.d. here). This analysis led to interesting results (Fig.
<xref rid="Fig2" ref-type="fig">2</xref>
). First, we observed an equalization of BC TAX org vs. BC kmer correlation between the two cohorts (0.74 for HMP and 0.77 for China). Therefore, fraction of unmapped reads appears to be one of the major factors contributing to the difference between the cohorts. This parameter is dependent on the representability of the reference catalog and quality of sequencing run.</p>
<p>Second, we assessed the shift of each outlier in the direction of the central cloud of points. Quantitatively, for each outlier we calculated the BC kmer difference value: the difference between its BC kmer value and the linearly interpolated middle of the cloud obtained for the same reference-based value (Fig.
<xref rid="Fig2" ref-type="fig">2</xref>
). For comparison with BC TAX, the BC kmer difference decreased significantly - from 0.31 ± 0.09 to 0.03 ± 0.04 (Wilcoxon test,
<italic>p</italic>
=2.2
<italic>E</italic>
−16). For comparison with BC COG, the BC kmer difference changed slightly: from 0.34 ± 0.08 to 0.39 ± 0.07. Correspondingly, a group of pairs-outliers mentioned above moved into the central cloud of points in the BC TAX org vs. BC kmer comparison, but did not change their visual location in BC COG vs. BC kmer comparison.</p>
<p>This observation is in agreement with the fact that the gene reference catalog is more complete than the genome reference catalog and the percentage of mapping to the gene catalog is higher (49 ± 17 % vs 60 ± 5 % for HMP and 49 ± 12 % vs. 61 ± 6 % for China, respectively, Wilcoxon test,
<italic>p</italic>
=2.2
<italic>E</italic>
−16). Presumably, the presence of pairs-outliers could be caused by
<italic>k-</italic>
mers from certain dominant sequences that are present in the reference base of genes but not genomes. We investigated these outliers in details.</p>
</sec>
<sec id="Sec17">
<title>Investigation of samples-outliers</title>
<p>The total human gut metagenome is a phylogenetically diverse structure including not only the sequences of bacterial genomes but also ones from bacterial mobilome (phages, plasmids, etc.), fungi, protozoa, traces of DNA of dietary origin, host. Our reference genome catalog partly accounts for such non-microbial components by including the genomes of several common intestinal eukaryotes - clinically relevant yeasts
<italic>Candida</italic>
(3 genomes) and protozoan
<italic>Blastocystis</italic>
(1 genome; see
<xref rid="Sec2" ref-type="sec">Methods</xref>
for details). However, many sequences are not present in our genome catalog, particularly viral genomes. Therefore, in our analysis the potential reads of viral origin would not contribute to the taxonomic difference but would change the
<italic>k-</italic>
mer spectrum. Recently, a new bacteriophage was discovered - crAssphage - shown to be a sole major dominant of the human gut viriome [
<xref ref-type="bibr" rid="CR34">34</xref>
]. Moreover, its presence was estimated for the HMP metagenomes analyzed in our work: crAssphage genome amounts for up to 20 % of the reads for this group. Obviously, such a prevalent genome should have a significant influence on
<italic>k-</italic>
mer spectra and thus on our comparison of the beta-diversity measures.</p>
<p>Basing on the available data on the abundance of crAssphage in HMP samples (see
<xref rid="Sec2" ref-type="sec">Methods</xref>
for details), the cohort was split into two groups - with high phage abundance (
<italic>n</italic>
=5,5−20
<italic>%</italic>
of crAssphage reads) and with low phage abundance (
<italic>n</italic>
=124,<5
<italic>%</italic>
of crAssphage reads). The whole group of extreme outliers was found to consist of the pairs where at least one of the samples belonged to the phage-enriched group (Fig.
<xref rid="Fig3" ref-type="fig">3</xref>
<xref rid="Fig3" ref-type="fig">a</xref>
, chi-square test:
<italic>X</italic>
<sup>2</sup>
=802.97,
<italic>p</italic>
=2.2
<italic>E</italic>
−16). Moreover, the outliers on Fig.
<xref rid="Fig1" ref-type="fig">1</xref>
<xref rid="Fig1" ref-type="fig">a</xref>
were also found to be the samples with high fraction of crAssphage reads (see Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
: Table S8).
<fig id="Fig3">
<label>Fig. 3</label>
<caption>
<p>Analysis of samples-outliers.
<bold>a</bold>
Distribution of pairwise dissimilarity obtained using
<italic>k-</italic>
mer and taxonomic composition for HMP cohort. Different colors indicate groups of dissimilarities for: all HMP pairs, pairs-outliers - where at least one of the samples belonged to the phage-enriched group; CP-filtered pairs - extreme outliers (all pairs with
<italic>k-</italic>
mer distance > 0.5) after removal of
<italic>k-</italic>
mers from reads mapped to crAssphage (CP) genome;
<bold>b</bold>
Composition of sample SRS062427 according to the combined results from two analyses (mapping to genome catalog and DIAMOND + MEGAN)</p>
</caption>
<graphic xlink:href="12859_2015_875_Fig3_HTML" id="MO3"></graphic>
</fig>
</p>
<p>As the extreme outliers were found to be generated by pairs including at least one of the two samples (SRS062427 and SRS014287, see Fig.
<xref rid="Fig1" ref-type="fig">1</xref>
<xref rid="Fig1" ref-type="fig">a</xref>
), these samples were analyzed in detail. The reads that did not map to the genome catalog (86 % and 88 % from the total read number, respectively) were subject to metagenomic classification using an alternative method - using DIAMOND alignment and MEGAN classifier algorithms (see
<xref rid="Sec2" ref-type="sec">Methods</xref>
). As a result, additionally 35 % and 29 % of the reads were identified as crAssphage (Fig.
<xref rid="Fig3" ref-type="fig">3</xref>
<xref rid="Fig3" ref-type="fig">b</xref>
). To further confirm the contribution of high phage fraction to formation of outliers, we subtracted the
<italic>k-</italic>
mers of the crAssphage reads from
<italic>k-</italic>
mer spectra of the samples. Indeed, such operation significantly decreased the
<italic>k-</italic>
mer-based dissimilarity for the respective pairs (0.57 ± 0.08 to 0.53 ± 0.07, one-tailed Wilcoxon test,
<italic>p</italic>
<2.2
<italic>E</italic>
−16, Fig.
<xref rid="Fig3" ref-type="fig">3</xref>
<xref rid="Fig3" ref-type="fig">a</xref>
).</p>
</sec>
</sec>
</sec>
<sec id="Sec18" sec-type="discussion">
<title>Discussion</title>
<p>Here we have developed an algorithm for assessing pairwise dissimilarity of “shotgun” metagenomes basing on
<italic>k-</italic>
mer spectrum and compared it with commonly used reference-based approaches. The comparison was performed using various measures (Bray-Curtis dissimilarity and whole-genome adaptation of UniFrac) on a set of simulated metagenomes as well as on real metagenomes from two large-scale human gut microbiota studies.</p>
<p>For simulated metagenomes, we showed that
<italic>k</italic>
= 11 is an optimal value in terms of balancing between the resolution of the method and computational time. This value of
<italic>k</italic>
performed well for both high- and low-diversity simulated metagenomes; however, for low-diversity simulations the dissimilarity matrices based on
<italic>k-</italic>
mer method and taxonomic composition were less correlated (Spearman correlation
<italic>r</italic>
=0.94 and
<italic>r</italic>
= 0.87 for high- and low-diversity, respectively). This fact was likely due to the decreased diversity of
<italic>k-</italic>
mers and thus reduced differentiating resolution. For real gut metagenomes with complex community structure, the
<italic>k-</italic>
mer approach allows to delineate the samples with a wide range of functional composition, as demonstrated on two international cohorts (HMP and Chinese population). On the other hand, we observed that
<italic>k-</italic>
mers are less correlated with taxonomic composition than with functional (gene-based) one. We speculate that this difference could be associated with significant subspecies-level genomic diversity of gut microbes: a recent analysis of publicly available metagenomic data showed that the average gene variation between individuals across 11 abundant species was as high as 13 ± 4.5 % [
<xref ref-type="bibr" rid="CR40">40</xref>
]. The
<italic>k-</italic>
mer frequencies as well as the gene relative abundance features are sensitive to gene content variation, while in the case of species relative abundance features it would be ignored.</p>
<p>Besides gene presence/absence, another common form of genomic variability is SNPs. We attempted to model their influence on
<italic>k-</italic>
mer beta-diversity. Theoretically, introduction of SNPs would lead to change in frequencies of
<italic>k-</italic>
mers and thus deteriorate correlation between
<italic>k-</italic>
mer and reference-based dissimilarity. In our simulations, when 1 % SNPs were introduced to simulated datasets (according to an estimate for gut bacteria [
<xref ref-type="bibr" rid="CR39">39</xref>
]), the correlation between the methods dropped slightly (from
<italic>r</italic>
=0.938 to
<italic>r</italic>
=0.935), independent of whether the evolutionary character of SNP accumulation was considered during modeling or not. However, for real metagenomes the correlation between the methods was lower (HMP:
<italic>r</italic>
=0.73, China:
<italic>r</italic>
=0.76). These results suggest the existence of other real-life effects having stronger influence on the correlation than SNPs (not only other types of genetic polymorphisms like indels but also including technical factors, etc.).</p>
<p>A major advantage of the
<italic>k-</italic>
mer approach is that it exploits the totality of the reads - unlike reference-based methods that inherently discard the reads that failed to map to the reference catalog - and thus the information contained within them. Such a feature promotes the application of the
<italic>k-</italic>
mer approach as a tool for assessing the representability of the reference set for given metagenomes. Currently representative sets of reference genomes are available for microbial communities of few environments (e.g. human gut). However, recent discoveries imply that the so called reference genomes do not capture a wide intra-species level variation even for this extensively examined community [
<xref ref-type="bibr" rid="CR41">41</xref>
,
<xref ref-type="bibr" rid="CR42">42</xref>
]. The situation is even worse for less popular environments - like marine ecosystems [
<xref ref-type="bibr" rid="CR43">43</xref>
] or human skin [
<xref ref-type="bibr" rid="CR44">44</xref>
]: reference catalogs for their microbiota are considerably less complete, thus rendering beta-diversity assessment difficult.</p>
<p>We propose to assess the representability of a reference genome catalog via examining the
<italic>k-</italic>
mer content of the metagenomic reads mapped to it. On the analyzed gut metagenomic data, we observed that
<italic>k-</italic>
mer spectra of the mapped reads produced dissimilarity profiles that had higher correlation with those obtained with taxonomic composition than the
<italic>k-</italic>
mer spectra of the whole set of reads. However, lower correlation between the two methods observed for some pairs of samples suggested the presence of dominant genomic sequences not included in the reference catalog. Detailed analysis showed that these outliers corresponded to the HMP samples enriched in crAssphage, a recently discovered gut bacteriophage [
<xref ref-type="bibr" rid="CR34">34</xref>
]; the genome of this phage was not included in the respective reference catalog.</p>
<p>Subtraction of the crAssphage
<italic>k-</italic>
mers moved the outliers towards the main cloud of points but not into it completely (BC kmer difference decreased by 90 ± 10 %). Presumably, such incomplete compensation can be linked to high level of genomic variability inherent to gut phages [
<xref ref-type="bibr" rid="CR45">45</xref>
]: originally the consensus crAssphage genome was obtained by combined assembly of 12 metagenomes from individuals not included in our groups [
<xref ref-type="bibr" rid="CR46">46</xref>
], so its sequence in the latter might be quite distant than all the crAssphage-related
<italic>k-</italic>
mers in our analysis. Additionally, over 50 % of the reads remain unidentified by two different methods (mapping to reference genome catalog and DIAMOND+MEGAN-based pipeline) they can correspond to genome(s) contributing to formation of outliers.</p>
<p>Considering the gene catalog, dedicated analysis of the reads mapped to reference genes did not lead to shift of outliers (BC kmer difference slightly increased by 16 ± 9 %). First, a likely reason for this is that the crAssphage sequences were included in the catalog: search for crAssphage genes in the reference gene catalog (see
<xref rid="Sec2" ref-type="sec">Methods</xref>
) identified highly similar hits for 70 of the 80 phage genes (182 catalog genes in total) that were detected in at least one metagenome. Second, the gut microbial gene catalog was originally constructed from the contigs assembled from total DNA reads [
<xref ref-type="bibr" rid="CR49">49</xref>
] and is known to contain not only the bacterial genes, but viral and eukaryotic, too.</p>
<p>Interestingly, our results also imply that the Chinese cohort lacks metagenomes with such high prevalence of this phage, provoking speculations on world-wide phage phylogeography. While no clinical associations for crAssphage have been described to date, omission of phage components could be a significant miss in biomedical studies of microbiota. There is a growing understanding that gut phages play an important role in the ecology of “phage-gut microbiota-human” system and include potential biomarkers; they are able to transfer clinically important bacterial genes - e.g. antibiotic resistance and pathogenicity determinants [
<xref ref-type="bibr" rid="CR47">47</xref>
]. Application of our reference-free
<italic>k-</italic>
mer approach can facilitate early detection of such sequences in biomedical diagnostics data and discovery of novel biomarkers.</p>
<p>Our approach is not only applicable to metagenomes from an arbitrary environment, but is indispensable for dissimilarity and cluster analysis of communities with poorly described components. The approach allows to detect a major presence of an unknown organism and/or virus in a metagenome. We suggest that the approach should be introduced as a necessary method of “shotgun” metagenome composition analysis complementary to reference mapping in order to avoid biases associated with unrepresentative reference database.</p>
<p>Although we did not find evidence for outliers caused by technical issues in the examined datasets, the approach can also be used for primary detection of metagenomes with abnormal composition caused by high abundance of host DNA (e.g. in case of inflammatory process or specific to biopsy material), DNA of dietary origin (undigested food) and technical artifacts (e.g. dominance of sequencing adapters).</p>
<p>Finally, comparison of the metagenomes basing on
<italic>k-</italic>
mer spectrum provides more information than mapping to reference sequence catalogs. Essentially,
<italic>k-</italic>
mer analysis is a feature extraction procedure applied to metagenomic reads. The produced set of features (
<italic>k-</italic>
mer spectrum) is several orders of magnitude larger than one yielded in reference-based approaches. Therefore, it provides higher discriminative resolution that opens a promising opportunity for developing a new generation of methods for metagenomic analysis, and our method makes a step towards understanding of how to explore such high-dimensional feature space efficiently.</p>
</sec>
<sec id="Sec19" sec-type="conclusion">
<title>Conclusions</title>
<p>Analysis of
<italic>k-</italic>
mer spectra for both simulated and real “shotgun” metagenomes showed that this method allows quick assessment of the pairwise dissimilarity of such datasets. Simulations show that the method is robust to variability introduced by sequencing errors and genomic mutations. The obtained dissimilarity matrix can be used not only for cluster analysis and classification purposes, but also for early detection of major unknown components and quality control of reference-based approaches. It is recommended that the method should be included as a complementary step in high-throughput computational pipelines for metagenomic data analysis.</p>
</sec>
</body>
<back>
<app-group>
<app id="App1">
<sec id="Sec20">
<title>Additional files</title>
<p>
<media position="anchor" xlink:href="12859_2015_875_MOESM1_ESM.xls" id="MOESM1">
<label>Additional file 1</label>
<caption>
<p>
<bold>Supplementary tables.</bold>
Supplementary
<bold>Table S1.</bold>
Bacterial abundances in Simulation 1 (high-diversity communities). Supplementary
<bold>Table S2.</bold>
Bacterial abundances in Simulation 2 (low-diversity communities). Supplementary
<bold>Table S3.</bold>
List of genomes in taxonomic catalog for human gut. Supplementary
<bold>Table S4.</bold>
Taxonomic composition for real dataset (organism level). Supplementary
<bold>Table S5.</bold>
Taxonomic composition for real dataset (genus level). Supplementary
<bold>Table S6.</bold>
Functional composition for real dataset (COG). Supplementary
<bold>Table S7.</bold>
Taxonomic composition for real dataset by MetaPhlAn (organism level). Supplementary
<bold>Table S8.</bold>
Mapped read counts and percentage of mapping on taxonomic and functional catalog and phage genome. (XLS 2693 kb)</p>
</caption>
</media>
</p>
<p>
<media position="anchor" xlink:href="12859_2015_875_MOESM2_ESM.png" id="MOESM2">
<label>Additional file 2</label>
<caption>
<p>
<bold>Figure S5.</bold>
Correlation between
<italic>k-</italic>
mer-based dissimilarity matrix obtained using the original full metagenomic readsets and their subsampled versions. The same amount of reads was repeatedly sampled from 20 randomly selected metagenomes (total amount ranging from 30 to 50 mln reads). (PNG 69 kb)</p>
</caption>
</media>
</p>
<p>
<media position="anchor" xlink:href="12859_2015_875_MOESM3_ESM.png" id="MOESM3">
<label>Additional file 3</label>
<caption>
<p>
<bold>Figure S1.</bold>
Comparison of pairwise difference measures obtained by
<italic>k-</italic>
mer and taxonomic composition (50 simulated metagenomes). For each pair Y-axis represents
<italic>k-</italic>
mer dissimilarity, X-axis represents dissimilarity by taxonomic composition.
<italic>k</italic>
=10 A) using the original bacterial genomes; B) using the bacterial genomes with 1 % of SNPs. (PNG 451 kb)</p>
</caption>
</media>
</p>
<p>
<media position="anchor" xlink:href="12859_2015_875_MOESM4_ESM.png" id="MOESM4">
<label>Additional file 4</label>
<caption>
<p>
<bold>Figure S2.</bold>
Comparison of pairwise dissimilarity measures obtained by
<italic>k-</italic>
mer and taxonomic composition for simulated for high- and low-diversity metagenomes. As seen, satisfactory correlation of
<italic>k-</italic>
mers with taxonomic composition can be obtained only at relatively high values of
<italic>k</italic>
. (PNG 233 kb)</p>
</caption>
</media>
</p>
<p>
<media position="anchor" xlink:href="12859_2015_875_MOESM5_ESM.png" id="MOESM5">
<label>Additional file 5</label>
<caption>
<p>
<bold>Figure S3.</bold>
Correlation between
<italic>k-</italic>
mer and taxonomic composition dissimilarity matrices, as well as
<italic>k-</italic>
mer dissimilarity matrix computation time with varying values of
<italic>k</italic>
. All computations were performed on a compute node with CPU Opteron 6176 2.3 GHz (24 cores) and 64 Gb RAM. (PNG 185 kb)</p>
</caption>
</media>
</p>
<p>
<media position="anchor" xlink:href="12859_2015_875_MOESM6_ESM.png" id="MOESM6">
<label>Additional file 6</label>
<caption>
<p>
<bold>Figure S4.</bold>
Comparison of dissimilarity measures obtained by
<italic>k-</italic>
mer and 3 reference-based methods: BC MetaPhlAn genus, BC MetaPhlAn org and WG UniFrac. For each plot, Y-axis represents
<italic>k-</italic>
mer dissimilarity, X-axis - dissimilarity using one of reference-based methods. (PNG 925 kb)</p>
</caption>
</media>
</p>
</sec>
</app>
</app-group>
<glossary>
<title>Abbreviations</title>
<def-list>
<def-item>
<term>16S rRNA</term>
<def>
<p>16S ribosomal ribonucleic acid</p>
</def>
</def-item>
<def-item>
<term>BC</term>
<def>
<p>Bray-Curtis dissimilarity measure</p>
</def>
</def-item>
<def-item>
<term>BC kmer</term>
<def>
<p>BC based on
<italic>k-</italic>
mer spectrum</p>
</def>
</def-item>
<def-item>
<term>BC TAX</term>
<def>
<p>BC based on taxonomic composition, obtained by mapping to catalog of genomes</p>
</def>
</def-item>
<def-item>
<term>BC MetaPhlAn</term>
<def>
<p>BC based on taxonomic composition, obtained by mapping to clade-specific marker genes</p>
</def>
</def-item>
<def-item>
<term>BC COG</term>
<def>
<p>BC based on functional composition, obtained by mapping to catalog of COG (Clusters of Orthologous Groups) genes</p>
</def>
</def-item>
<def-item>
<term>DNA</term>
<def>
<p>Deoxyribonucleic acid</p>
</def>
</def-item>
<def-item>
<term>SNP</term>
<def>
<p>Single Nucleotide Polymorphism</p>
</def>
</def-item>
<def-item>
<term>WGS</term>
<def>
<p>whole genome sequencing, “shotgun”</p>
</def>
</def-item>
<def-item>
<term>WG UniFrac</term>
<def>
<p>whole genome adaptation of UniFrac measure</p>
</def>
</def-item>
</def-list>
</glossary>
<fn-group>
<fn>
<p>
<bold>Competing interests</bold>
</p>
<p>The authors declare that they have no competing interests.</p>
</fn>
<fn>
<p>
<bold>Authors’ contributions</bold>
</p>
<p>DGA and AVT initiated the project and designed the study. VIU wrote the programs for
<italic>k-</italic>
mer counting. VBD wrote the programs for dissimilarity matrix calculation, generated the simulation datasets and analyzed the data. DSI performed statistical modeling of SNPs effect. DGA and AVT guided the comparison of dissimilarity measures obtained by
<italic>k-</italic>
mer and reference-based methods. VBD and AVT wrote the manuscript under the supervision of DGA. All authors read and approved the final manuscript.</p>
</fn>
</fn-group>
<ack>
<title>Acknowledgements</title>
<p>The authors would like to thank Will Trimble who have provided valuable comments. AVT and DSI were supported by Russian Scientific Foundation (project
<italic>#</italic>
15-14-00066), VBD and DGA by Ministry of Education and Science of the Russian Federation (RFMEFI57514X0075). VIU was financially supported by the Government of Russian Federation, Grant 074-U01. We thank Alexander Manolov, Boris Kovarsky and Mikhail Shevtsov for assistance with the data analysis and code.</p>
</ack>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dick</surname>
<given-names>GJ</given-names>
</name>
<name>
<surname>Andersson</surname>
<given-names>AF</given-names>
</name>
<name>
<surname>Baker</surname>
<given-names>BJ</given-names>
</name>
<name>
<surname>Simmons</surname>
<given-names>SL</given-names>
</name>
<name>
<surname>Thomas</surname>
<given-names>BC</given-names>
</name>
<name>
<surname>Yelton</surname>
<given-names>AP</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Community-wide analysis of microbial genome sequence signatures</article-title>
<source>Genome Biol</source>
<year>2009</year>
<volume>10</volume>
<issue>8</issue>
<fpage>85</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2009-10-8-r85</pub-id>
</element-citation>
</ref>
<ref id="CR2">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Park</surname>
<given-names>EJ</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>KH</given-names>
</name>
<name>
<surname>Abell</surname>
<given-names>GCJ</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>MS</given-names>
</name>
<name>
<surname>Roh</surname>
<given-names>SW</given-names>
</name>
<name>
<surname>Bae</surname>
<given-names>JW</given-names>
</name>
</person-group>
<article-title>Metagenomic Analysis of the Viral Communities in Fermented Foods</article-title>
<source>Appl Environ Microbiol</source>
<year>2010</year>
<volume>77</volume>
<issue>4</issue>
<fpage>1284</fpage>
<lpage>91</lpage>
<pub-id pub-id-type="doi">10.1128/AEM.01859-10</pub-id>
<pub-id pub-id-type="pmid">21183634</pub-id>
</element-citation>
</ref>
<ref id="CR3">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Singh</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Gautam</surname>
<given-names>SK</given-names>
</name>
<name>
<surname>Verma</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Kumar</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Singh</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Metagenomics in animal gastrointestinal ecosystem: Potential biotechnological prospects</article-title>
<source>Anaerobe</source>
<year>2008</year>
<volume>14</volume>
<issue>3</issue>
<fpage>138</fpage>
<lpage>44</lpage>
<pub-id pub-id-type="doi">10.1016/j.anaerobe.2008.03.002</pub-id>
<pub-id pub-id-type="pmid">18457965</pub-id>
</element-citation>
</ref>
<ref id="CR4">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Morgan</surname>
<given-names>XC</given-names>
</name>
<name>
<surname>Segata</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Huttenhower</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Biodiversity and functional genomics in the human microbiome</article-title>
<source>Trends in genetics: TIG</source>
<year>2013</year>
<volume>29</volume>
<issue>1</issue>
<fpage>51</fpage>
<lpage>8</lpage>
<pub-id pub-id-type="doi">10.1016/j.tig.2012.09.005</pub-id>
<pub-id pub-id-type="pmid">23140990</pub-id>
</element-citation>
</ref>
<ref id="CR5">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Riesenfeld</surname>
<given-names>CS</given-names>
</name>
<name>
<surname>Schloss</surname>
<given-names>PD</given-names>
</name>
<name>
<surname>Handelsman</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Metagenomics: Genomic Analysis of Microbial Communities</article-title>
<source>Annu Rev Genet</source>
<year>2004</year>
<volume>38</volume>
<fpage>525</fpage>
<lpage>552</lpage>
<pub-id pub-id-type="doi">10.1146/annurev.genet.38.072902.091216</pub-id>
<pub-id pub-id-type="pmid">15568985</pub-id>
</element-citation>
</ref>
<ref id="CR6">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lozupone</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Lladser</surname>
<given-names>ME</given-names>
</name>
<name>
<surname>Knights</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Stombaugh</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Knight</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>UniFrac: an effective distance metric for microbial community comparison</article-title>
<source>ISME J</source>
<year>2011</year>
<volume>5</volume>
<issue>2</issue>
<fpage>169</fpage>
<lpage>72</lpage>
<pub-id pub-id-type="doi">10.1038/ismej.2010.133</pub-id>
<pub-id pub-id-type="pmid">20827291</pub-id>
</element-citation>
</ref>
<ref id="CR7">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Teeling</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Glöckner</surname>
<given-names>FO</given-names>
</name>
</person-group>
<article-title>Current opportunities and challenges in microbial metagenome analysis–a bioinformatic perspective</article-title>
<source>Brief Bioinform</source>
<year>2012</year>
<volume>13</volume>
<issue>6</issue>
<fpage>728</fpage>
<lpage>42</lpage>
<pub-id pub-id-type="doi">10.1093/bib/bbs039</pub-id>
<pub-id pub-id-type="pmid">22966151</pub-id>
</element-citation>
</ref>
<ref id="CR8">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yang</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Peng</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Leung</surname>
<given-names>HC-M</given-names>
</name>
<name>
<surname>Yiu</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Chin</surname>
<given-names>FY-L</given-names>
</name>
</person-group>
<article-title>Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers</article-title>
<source>BMC Bioinformatics</source>
<year>2010</year>
<volume>11 Suppl 2</volume>
<issue>Suppl 2</issue>
<fpage>5</fpage>
<pub-id pub-id-type="pmid">20047655</pub-id>
</element-citation>
</ref>
<ref id="CR9">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Plaza Onate</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Batto</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Juste</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Fadlallah</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Fougeroux</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Gouas</surname>
<given-names>D</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Quality control of microbiota metagenomics by k-mer analysis</article-title>
<source>BMC Genomics</source>
<year>2015</year>
<volume>16</volume>
<issue>1</issue>
<fpage>183</fpage>
<pub-id pub-id-type="doi">10.1186/s12864-015-1406-7</pub-id>
<pub-id pub-id-type="pmid">25887914</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Olman</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>Barcodes for genomes and applications</article-title>
<source>BMC Bioinformatics</source>
<year>2008</year>
<volume>9</volume>
<issue>1</issue>
<fpage>546</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-9-546</pub-id>
<pub-id pub-id-type="pmid">19091119</pub-id>
</element-citation>
</ref>
<ref id="CR11">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pride</surname>
<given-names>DT</given-names>
</name>
<name>
<surname>Meinersmann</surname>
<given-names>RJ</given-names>
</name>
<name>
<surname>Wassenaar</surname>
<given-names>TM</given-names>
</name>
<name>
<surname>Blaser</surname>
<given-names>MJ</given-names>
</name>
</person-group>
<article-title>Evolutionary implications of microbial genome tetranucleotide frequency biases</article-title>
<source>Genome Res</source>
<year>2003</year>
<volume>13</volume>
<issue>2</issue>
<fpage>145</fpage>
<lpage>58</lpage>
<pub-id pub-id-type="doi">10.1101/gr.335003</pub-id>
<pub-id pub-id-type="pmid">12566393</pub-id>
</element-citation>
</ref>
<ref id="CR12">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Alsop</surname>
<given-names>EB</given-names>
</name>
<name>
<surname>Raymond</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Resolving prokaryotic taxonomy without rRNA: longer oligonucleotide word lengths improve genome and metagenome taxonomic classification</article-title>
<source>PloS One</source>
<year>2013</year>
<volume>8</volume>
<issue>7</issue>
<fpage>67337</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0067337</pub-id>
</element-citation>
</ref>
<ref id="CR13">
<label>13</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cui</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X</given-names>
</name>
</person-group>
<article-title>Alignment-free supervised classification of metagenomes by recursive SVM</article-title>
<source>BMC Genomics</source>
<year>2013</year>
<volume>14</volume>
<issue>1</issue>
<fpage>641</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2164-14-641</pub-id>
<pub-id pub-id-type="pmid">24053649</pub-id>
</element-citation>
</ref>
<ref id="CR14">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Silva</surname>
<given-names>GGZ</given-names>
</name>
<name>
<surname>Cuevas</surname>
<given-names>DA</given-names>
</name>
<name>
<surname>Dutilh</surname>
<given-names>BE</given-names>
</name>
<name>
<surname>Edwards</surname>
<given-names>RA</given-names>
</name>
</person-group>
<article-title>FOCUS: an alignment-free model to identify organisms in metagenomes using non-negative least squares</article-title>
<source>PeerJ</source>
<year>2014</year>
<volume>2</volume>
<fpage>425</fpage>
<pub-id pub-id-type="doi">10.7717/peerj.425</pub-id>
</element-citation>
</ref>
<ref id="CR15">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Langenkämper</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Goesmann</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Nattkemper</surname>
<given-names>TW</given-names>
</name>
</person-group>
<article-title>AKE - the Accelerated k-mer Exploration web-tool for rapid taxonomic classification and visualization</article-title>
<source>BMC Bioinformatics</source>
<year>2014</year>
<volume>15</volume>
<issue>1</issue>
<fpage>384</fpage>
<pub-id pub-id-type="doi">10.1186/s12859-014-0384-0</pub-id>
<pub-id pub-id-type="pmid">25495116</pub-id>
</element-citation>
</ref>
<ref id="CR16">
<label>16</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liao</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Guan</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>A New Unsupervised Binning Approach for Metagenomic Sequences Based on N-grams and Automatic Feature Weighting</article-title>
<source>IEEE/ACM Trans Comput Biol Bioinformatics</source>
<year>2014</year>
<volume>11</volume>
<issue>1</issue>
<fpage>42</fpage>
<lpage>54</lpage>
<pub-id pub-id-type="doi">10.1109/TCBB.2013.137</pub-id>
</element-citation>
</ref>
<ref id="CR17">
<label>17</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Seth</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Välimäki</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Kaski</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Honkela</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Exploration and retrieval of whole-metagenome sequencing samples</article-title>
<source>Bioinformatics (Oxford, England)</source>
<year>2014</year>
<volume>30</volume>
<issue>17</issue>
<fpage>2471</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btu340</pub-id>
</element-citation>
</ref>
<ref id="CR18">
<label>18</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ames</surname>
<given-names>SK</given-names>
</name>
<name>
<surname>Hysom</surname>
<given-names>DA</given-names>
</name>
<name>
<surname>Gardner</surname>
<given-names>SN</given-names>
</name>
<name>
<surname>Lloyd</surname>
<given-names>GS</given-names>
</name>
<name>
<surname>Gokhale</surname>
<given-names>MB</given-names>
</name>
<name>
<surname>Allen</surname>
<given-names>JE</given-names>
</name>
</person-group>
<article-title>Scalable metagenomic taxonomy classification using a reference genome database</article-title>
<source>Bioinformatics (Oxford, England)</source>
<year>2013</year>
<volume>29</volume>
<issue>18</issue>
<fpage>2253</fpage>
<lpage>60</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt389</pub-id>
</element-citation>
</ref>
<ref id="CR19">
<label>19</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>YW</given-names>
</name>
<name>
<surname>Ye</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>A novel abundance-based algorithm for binning metagenomic sequences using l-tuples</article-title>
<source>J Comput Biol J Comput Mol Cell Biol.</source>
<year>2011</year>
<volume>18</volume>
<issue>3</issue>
<fpage>523</fpage>
<lpage>34</lpage>
<pub-id pub-id-type="doi">10.1089/cmb.2010.0245</pub-id>
</element-citation>
</ref>
<ref id="CR20">
<label>20</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jiang</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Song</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Deng</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X</given-names>
</name>
</person-group>
<article-title>Comparison of metagenomic samples using sequence signatures</article-title>
<source>BMC Genomics</source>
<year>2012</year>
<volume>13</volume>
<issue>1</issue>
<fpage>730</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2164-13-730</pub-id>
<pub-id pub-id-type="pmid">23268604</pub-id>
</element-citation>
</ref>
<ref id="CR21">
<label>21</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>Comparison of metatranscriptomic samples based on k-tuple frequencies</article-title>
<source>PloS One</source>
<year>2014</year>
<volume>9</volume>
<issue>1</issue>
<fpage>84348</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0084348</pub-id>
</element-citation>
</ref>
<ref id="CR22">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vinga</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Almeida</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Alignment-free sequence comparison–a review</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>19</volume>
<issue>4</issue>
<fpage>513</fpage>
<lpage>23</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg005</pub-id>
<pub-id pub-id-type="pmid">12611807</pub-id>
</element-citation>
</ref>
<ref id="CR23">
<label>23</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marçais</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Kingsford</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>A fast, lock-free approach for efficient parallel counting of occurrences of k-mers</article-title>
<source>Bioinformatics (Oxford, England)</source>
<year>2011</year>
<volume>27</volume>
<issue>6</issue>
<fpage>764</fpage>
<lpage>0</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr011</pub-id>
</element-citation>
</ref>
<ref id="CR24">
<label>24</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Audano</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Vannberg</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>KAnalyze: a fast versatile pipelined k-mer toolkit</article-title>
<source>Bioinformatics (Oxford, England)</source>
<year>2014</year>
<volume>30</volume>
<issue>14</issue>
<fpage>2070</fpage>
<lpage>2</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btu152</pub-id>
</element-citation>
</ref>
<ref id="CR25">
<label>25</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bäckhed</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Ley</surname>
<given-names>RE</given-names>
</name>
<name>
<surname>Sonnenburg</surname>
<given-names>JL</given-names>
</name>
<name>
<surname>Peterson</surname>
<given-names>DA</given-names>
</name>
<name>
<surname>Gordon</surname>
<given-names>JI</given-names>
</name>
</person-group>
<article-title>Host-bacterial mutualism in the human intestine</article-title>
<source>Science (New York, N.Y.)</source>
<year>2005</year>
<volume>307</volume>
<issue>5717</issue>
<fpage>1915</fpage>
<lpage>20</lpage>
<pub-id pub-id-type="doi">10.1126/science.1104816</pub-id>
</element-citation>
</ref>
<ref id="CR26">
<label>26</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Richter</surname>
<given-names>DC</given-names>
</name>
<name>
<surname>Ott</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Auch</surname>
<given-names>AF</given-names>
</name>
<name>
<surname>Schmid</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Huson</surname>
<given-names>DH</given-names>
</name>
</person-group>
<article-title>MetaSim: a sequencing simulator for genomics and metagenomics</article-title>
<source>PloS One</source>
<year>2008</year>
<volume>3</volume>
<issue>10</issue>
<fpage>3373</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0003373</pub-id>
</element-citation>
</ref>
<ref id="CR27">
<label>27</label>
<mixed-citation publication-type="other">Structure, function and diversity of the healthy human microbiome. Nature. 2012; 486(7402):207–14. doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1038/nature11234">10.1038/nature11234</ext-link>
.</mixed-citation>
</ref>
<ref id="CR28">
<label>28</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qin</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Cai</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>F</given-names>
</name>
<etal></etal>
</person-group>
<article-title>A metagenome-wide association study of gut microbiota in type 2 diabetes</article-title>
<source>Nature</source>
<year>2012</year>
<volume>490</volume>
<issue>7418</issue>
<fpage>55</fpage>
<lpage>60</lpage>
<pub-id pub-id-type="doi">10.1038/nature11450</pub-id>
<pub-id pub-id-type="pmid">23023125</pub-id>
</element-citation>
</ref>
<ref id="CR29">
<label>29</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pearson</surname>
<given-names>WR</given-names>
</name>
<name>
<surname>Wood</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>W</given-names>
</name>
</person-group>
<article-title>Comparison of DNA sequences with protein sequences</article-title>
<source>Genomics</source>
<year>1997</year>
<volume>46</volume>
<issue>1</issue>
<fpage>24</fpage>
<lpage>36</lpage>
<pub-id pub-id-type="doi">10.1006/geno.1997.4995</pub-id>
<pub-id pub-id-type="pmid">9403055</pub-id>
</element-citation>
</ref>
<ref id="CR30">
<label>30</label>
<mixed-citation publication-type="other">Hansen MA, Oey H, Fernandez-Valverde S, Jung CH, Mattick JS. Biopieces: A Bioinformatics Toolset and Framework.
<ext-link ext-link-type="uri" xlink:href="http://www.biopieces.org">http://www.biopieces.org</ext-link>
.</mixed-citation>
</ref>
<ref id="CR31">
<label>31</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tyakht</surname>
<given-names>AV</given-names>
</name>
<name>
<surname>Kostryukova</surname>
<given-names>ES</given-names>
</name>
<name>
<surname>Popenko</surname>
<given-names>AS</given-names>
</name>
<name>
<surname>Belenikin</surname>
<given-names>MS</given-names>
</name>
<name>
<surname>Pavlenko</surname>
<given-names>AV</given-names>
</name>
<name>
<surname>Larin</surname>
<given-names>AK</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Human gut microbiota community structures in urban and rural populations in Russia</article-title>
<source>Nat Commun</source>
<year>2013</year>
<volume>4</volume>
<fpage>2469</fpage>
<pub-id pub-id-type="doi">10.1038/ncomms3469</pub-id>
<pub-id pub-id-type="pmid">24036685</pub-id>
</element-citation>
</ref>
<ref id="CR32">
<label>32</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tatusov</surname>
<given-names>RL</given-names>
</name>
</person-group>
<article-title>The COG database: a tool for genome-scale analysis of protein functions and evolution</article-title>
<source>Nucleic Acids Res</source>
<year>2000</year>
<volume>28</volume>
<issue>1</issue>
<fpage>33</fpage>
<lpage>6</lpage>
<pub-id pub-id-type="doi">10.1093/nar/28.1.33</pub-id>
<pub-id pub-id-type="pmid">10592175</pub-id>
</element-citation>
</ref>
<ref id="CR33">
<label>33</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Langmead</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>Fast gapped-read alignment with Bowtie 2</article-title>
<source>Nat Methods</source>
<year>2012</year>
<volume>9</volume>
<issue>4</issue>
<fpage>357</fpage>
<lpage>9</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.1923</pub-id>
<pub-id pub-id-type="pmid">22388286</pub-id>
</element-citation>
</ref>
<ref id="CR34">
<label>34</label>
<mixed-citation publication-type="other">Dutilh BE, Cassman N, McNair K, Sanchez SE, Silva GGZ, Boling L, et al.A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes. Nat Commun. 2014;5. doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1038/ncomms5498">10.1038/ncomms5498</ext-link>
.</mixed-citation>
</ref>
<ref id="CR35">
<label>35</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Buchfink</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Xie</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Huson</surname>
<given-names>DH</given-names>
</name>
</person-group>
<article-title>Fast and sensitive protein alignment using DIAMOND</article-title>
<source>Nat Methods</source>
<year>2014</year>
<volume>12</volume>
<issue>1</issue>
<fpage>59</fpage>
<lpage>60</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.3176</pub-id>
<pub-id pub-id-type="pmid">25402007</pub-id>
</element-citation>
</ref>
<ref id="CR36">
<label>36</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huson</surname>
<given-names>DH</given-names>
</name>
<name>
<surname>Auch</surname>
<given-names>AF</given-names>
</name>
<name>
<surname>Qi</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Schuster</surname>
<given-names>SC</given-names>
</name>
</person-group>
<article-title>MEGAN analysis of metagenomic data</article-title>
<source>Genome Res</source>
<year>2007</year>
<volume>17</volume>
<issue>3</issue>
<fpage>377</fpage>
<lpage>86</lpage>
<pub-id pub-id-type="doi">10.1101/gr.5969107</pub-id>
<pub-id pub-id-type="pmid">17255551</pub-id>
</element-citation>
</ref>
<ref id="CR37">
<label>37</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chor</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Horn</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Goldman</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Levy</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Massingham</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Genomic DNA k-mer spectra: models and modalities</article-title>
<source>Genome Biol</source>
<year>2009</year>
<volume>10</volume>
<issue>10</issue>
<fpage>108</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2009-10-10-r108</pub-id>
<pub-id pub-id-type="pmid">19591645</pub-id>
</element-citation>
</ref>
<ref id="CR38">
<label>38</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Scholz</surname>
<given-names>MB</given-names>
</name>
<name>
<surname>Lo</surname>
<given-names>CC</given-names>
</name>
<name>
<surname>Chain</surname>
<given-names>PS</given-names>
</name>
</person-group>
<article-title>Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis</article-title>
<source>Curr Opinion Biotechnol</source>
<year>2012</year>
<volume>23</volume>
<issue>1</issue>
<fpage>9</fpage>
<lpage>15</lpage>
<pub-id pub-id-type="doi">10.1016/j.copbio.2011.11.013</pub-id>
</element-citation>
</ref>
<ref id="CR39">
<label>39</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schloissnig</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Arumugam</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Sunagawa</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Mitreva</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Tap</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Genomic variation landscape of the human gut microbiome</article-title>
<source>Nature</source>
<year>2013</year>
<volume>493</volume>
<issue>7430</issue>
<fpage>45</fpage>
<lpage>50</lpage>
<pub-id pub-id-type="doi">10.1038/nature11711</pub-id>
<pub-id pub-id-type="pmid">23222524</pub-id>
</element-citation>
</ref>
<ref id="CR40">
<label>40</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhu</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Sunagawa</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Mende</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Bork</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Inter-individual differences in the gene content of human gut bacterial species</article-title>
<source>Genome Biol</source>
<year>2015</year>
<volume>16</volume>
<issue>1</issue>
<fpage>82</fpage>
<pub-id pub-id-type="doi">10.1186/s13059-015-0646-9</pub-id>
<pub-id pub-id-type="pmid">25896518</pub-id>
</element-citation>
</ref>
<ref id="CR41">
<label>41</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Greenblum</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Carr</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Borenstein</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Extensive Strain-Level Copy-Number Variation across Human Gut Microbiome Species</article-title>
<source>Cell</source>
<year>2015</year>
<volume>160</volume>
<issue>4</issue>
<fpage>583</fpage>
<lpage>94</lpage>
<pub-id pub-id-type="doi">10.1016/j.cell.2014.12.038</pub-id>
<pub-id pub-id-type="pmid">25640238</pub-id>
</element-citation>
</ref>
<ref id="CR42">
<label>42</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nielsen</surname>
<given-names>HBR</given-names>
</name>
<name>
<surname>Almeida</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Juncker</surname>
<given-names>AS</given-names>
</name>
<name>
<surname>Rasmussen</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Sunagawa</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes</article-title>
<source>Nat Biotechnol</source>
<year>2014</year>
<volume>32</volume>
<issue>8</issue>
<fpage>822</fpage>
<lpage>8</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.2939</pub-id>
<pub-id pub-id-type="pmid">24997787</pub-id>
</element-citation>
</ref>
<ref id="CR43">
<label>43</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sunagawa</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Coelho</surname>
<given-names>LP</given-names>
</name>
<name>
<surname>Chaffron</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Kultima</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Labadie</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Salazar</surname>
<given-names>G</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Ocean plankton. Structure and function of the global ocean microbiome</article-title>
<source>Science (New York, N.Y.)</source>
<year>2015</year>
<volume>348</volume>
<issue>6237</issue>
<fpage>1261359</fpage>
<pub-id pub-id-type="doi">10.1126/science.1261359</pub-id>
</element-citation>
</ref>
<ref id="CR44">
<label>44</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Leung</surname>
<given-names>MHY</given-names>
</name>
<name>
<surname>Wilkins</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>PKH</given-names>
</name>
</person-group>
<article-title>Insights into the pan-microbiome: skin microbial communities of Chinese individuals differ from other racial groups</article-title>
<source>Sci Rep</source>
<year>2015</year>
<volume>5</volume>
<fpage>11845</fpage>
<pub-id pub-id-type="doi">10.1038/srep11845</pub-id>
<pub-id pub-id-type="pmid">26177982</pub-id>
</element-citation>
</ref>
<ref id="CR45">
<label>45</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Minot</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sinha</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Keilbaugh</surname>
<given-names>SA</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>GD</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The human gut virome: inter-individual variation and dynamic response to diet</article-title>
<source>Genome Res</source>
<year>2011</year>
<volume>21</volume>
<issue>10</issue>
<fpage>1616</fpage>
<lpage>25</lpage>
<pub-id pub-id-type="doi">10.1101/gr.122705.111</pub-id>
<pub-id pub-id-type="pmid">21880779</pub-id>
</element-citation>
</ref>
<ref id="CR46">
<label>46</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Reyes</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Haynes</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Hanson</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Angly</surname>
<given-names>FE</given-names>
</name>
<name>
<surname>Heath</surname>
<given-names>AC</given-names>
</name>
<name>
<surname>Rohwer</surname>
<given-names>F</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Viruses in the faecal microbiota of monozygotic twins and their mothers</article-title>
<source>Nature</source>
<year>2010</year>
<volume>466</volume>
<issue>7304</issue>
<fpage>334</fpage>
<lpage>8</lpage>
<pub-id pub-id-type="doi">10.1038/nature09199</pub-id>
<pub-id pub-id-type="pmid">20631792</pub-id>
</element-citation>
</ref>
<ref id="CR47">
<label>47</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Modi</surname>
<given-names>SR</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>HH</given-names>
</name>
<name>
<surname>Spina</surname>
<given-names>CS</given-names>
</name>
<name>
<surname>Collins</surname>
<given-names>JJ</given-names>
</name>
</person-group>
<article-title>Antibiotic treatment expands the resistance reservoir and ecological network of the phage metagenome</article-title>
<source>Nature</source>
<year>2013</year>
<volume>499</volume>
<issue>7457</issue>
<fpage>219</fpage>
<lpage>2</lpage>
<pub-id pub-id-type="doi">10.1038/nature12212</pub-id>
<pub-id pub-id-type="pmid">23748443</pub-id>
</element-citation>
</ref>
<ref id="CR48">
<label>48</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Segata</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Waldron</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Ballarini</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Narasimhan</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Jousson</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Huttenhower</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Metagenomic microbial community profiling using unique clade-specific marker genes</article-title>
<source>Nat Methods</source>
<year>2012</year>
<volume>9</volume>
<issue>8</issue>
<fpage>811</fpage>
<lpage>4</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.2066</pub-id>
<pub-id pub-id-type="pmid">22688413</pub-id>
</element-citation>
</ref>
<ref id="CR49">
<label>49</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qin</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Raes</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Arumugam</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Burgdorf</surname>
<given-names>KS</given-names>
</name>
<name>
<surname>Manichanh</surname>
<given-names>C</given-names>
</name>
<etal></etal>
</person-group>
<article-title>A human gut microbial gene catalogue established by metagenomic sequencing : Article : Nature</article-title>
<source>Nature</source>
<year>2010</year>
<volume>464</volume>
<issue>7285</issue>
<fpage>59</fpage>
<lpage>65</lpage>
<pub-id pub-id-type="doi">10.1038/nature08821</pub-id>
<pub-id pub-id-type="pmid">20203603</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000261 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000261 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:4715287
   |texte=   Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:26774270" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021