Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 0010149 ( Pmc/Corpus ); précédent : 0010148; suivant : 0010150 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition</title>
<author>
<name sortKey="Koslicki, David" sort="Koslicki, David" uniqKey="Koslicki D" first="David" last="Koslicki">David Koslicki</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>Dept of Mathematics, Oregon State University, Corvallis, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Chatterjee, Saikat" sort="Chatterjee, Saikat" uniqKey="Chatterjee S" first="Saikat" last="Chatterjee">Saikat Chatterjee</name>
<affiliation>
<nlm:aff id="aff002">
<addr-line>Dept of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Shahrivar, Damon" sort="Shahrivar, Damon" uniqKey="Shahrivar D" first="Damon" last="Shahrivar">Damon Shahrivar</name>
<affiliation>
<nlm:aff id="aff002">
<addr-line>Dept of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Walker, Alan W" sort="Walker, Alan W" uniqKey="Walker A" first="Alan W." last="Walker">Alan W. Walker</name>
<affiliation>
<nlm:aff id="aff003">
<addr-line>Microbiology Group, Rowett Institute of Nutrition and Health, University of Aberdeen, Aberdeen, United Kingdom</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Francis, Suzanna C" sort="Francis, Suzanna C" uniqKey="Francis S" first="Suzanna C." last="Francis">Suzanna C. Francis</name>
<affiliation>
<nlm:aff id="aff004">
<addr-line>MRC Tropical Epidemiology Group, London School of Hygiene and Tropical Medicine, London, United Kingdom</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Fraser, Louise J" sort="Fraser, Louise J" uniqKey="Fraser L" first="Louise J." last="Fraser">Louise J. Fraser</name>
<affiliation>
<nlm:aff id="aff005">
<addr-line>Illumina Cambridge Ltd., Chesterford Research Park, Essex, United Kingdom</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Vehkaper, Mikko" sort="Vehkaper, Mikko" uniqKey="Vehkaper M" first="Mikko" last="Vehkaper">Mikko Vehkaper</name>
<affiliation>
<nlm:aff id="aff006">
<addr-line>Dept of Electronic and Electrical Engineering, University of Sheffield, Sheffield, United Kingdom</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Lan, Yueheng" sort="Lan, Yueheng" uniqKey="Lan Y" first="Yueheng" last="Lan">Yueheng Lan</name>
<affiliation>
<nlm:aff id="aff007">
<addr-line>Dept of Physics, Tsinghua University, Beijing, China</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Corander, Jukka" sort="Corander, Jukka" uniqKey="Corander J" first="Jukka" last="Corander">Jukka Corander</name>
<affiliation>
<nlm:aff id="aff008">
<addr-line>Dept of Mathematics and Statistics, University of Helsinki, Helsinki, Finland</addr-line>
</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">26496191</idno>
<idno type="pmc">4619776</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4619776</idno>
<idno type="RBID">PMC:4619776</idno>
<idno type="doi">10.1371/journal.pone.0140644</idno>
<date when="2015">2015</date>
<idno type="wicri:Area/Pmc/Corpus">001014</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">001014</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition</title>
<author>
<name sortKey="Koslicki, David" sort="Koslicki, David" uniqKey="Koslicki D" first="David" last="Koslicki">David Koslicki</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>Dept of Mathematics, Oregon State University, Corvallis, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Chatterjee, Saikat" sort="Chatterjee, Saikat" uniqKey="Chatterjee S" first="Saikat" last="Chatterjee">Saikat Chatterjee</name>
<affiliation>
<nlm:aff id="aff002">
<addr-line>Dept of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Shahrivar, Damon" sort="Shahrivar, Damon" uniqKey="Shahrivar D" first="Damon" last="Shahrivar">Damon Shahrivar</name>
<affiliation>
<nlm:aff id="aff002">
<addr-line>Dept of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Walker, Alan W" sort="Walker, Alan W" uniqKey="Walker A" first="Alan W." last="Walker">Alan W. Walker</name>
<affiliation>
<nlm:aff id="aff003">
<addr-line>Microbiology Group, Rowett Institute of Nutrition and Health, University of Aberdeen, Aberdeen, United Kingdom</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Francis, Suzanna C" sort="Francis, Suzanna C" uniqKey="Francis S" first="Suzanna C." last="Francis">Suzanna C. Francis</name>
<affiliation>
<nlm:aff id="aff004">
<addr-line>MRC Tropical Epidemiology Group, London School of Hygiene and Tropical Medicine, London, United Kingdom</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Fraser, Louise J" sort="Fraser, Louise J" uniqKey="Fraser L" first="Louise J." last="Fraser">Louise J. Fraser</name>
<affiliation>
<nlm:aff id="aff005">
<addr-line>Illumina Cambridge Ltd., Chesterford Research Park, Essex, United Kingdom</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Vehkaper, Mikko" sort="Vehkaper, Mikko" uniqKey="Vehkaper M" first="Mikko" last="Vehkaper">Mikko Vehkaper</name>
<affiliation>
<nlm:aff id="aff006">
<addr-line>Dept of Electronic and Electrical Engineering, University of Sheffield, Sheffield, United Kingdom</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Lan, Yueheng" sort="Lan, Yueheng" uniqKey="Lan Y" first="Yueheng" last="Lan">Yueheng Lan</name>
<affiliation>
<nlm:aff id="aff007">
<addr-line>Dept of Physics, Tsinghua University, Beijing, China</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Corander, Jukka" sort="Corander, Jukka" uniqKey="Corander J" first="Jukka" last="Corander">Jukka Corander</name>
<affiliation>
<nlm:aff id="aff008">
<addr-line>Dept of Mathematics and Statistics, University of Helsinki, Helsinki, Finland</addr-line>
</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PLoS ONE</title>
<idno type="eISSN">1932-6203</idno>
<imprint>
<date when="2015">2015</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec id="sec001">
<title>Motivation</title>
<p>Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging.</p>
</sec>
<sec id="sec002">
<title>Results</title>
<p>There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order
<italic>k</italic>
-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The
<italic>aggregation of reads</italic>
is a
<italic>pre-processing</italic>
approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of
<italic>k</italic>
-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity.</p>
</sec>
<sec id="sec003">
<title>Availability</title>
<p>An open source, platform-independent implementation of the method in the Julia programming language is freely available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/dkoslicki/ARK">https://github.com/dkoslicki/ARK</ext-link>
. A Matlab implementation is available at
<ext-link ext-link-type="uri" xlink:href="http://www.ee.kth.se/ctsoftware">http://www.ee.kth.se/ctsoftware</ext-link>
.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Wang, Q" uniqKey="Wang Q">Q Wang</name>
</author>
<author>
<name sortKey="Garrity, Gm" uniqKey="Garrity G">GM Garrity</name>
</author>
<author>
<name sortKey="Tiedje, Jm" uniqKey="Tiedje J">JM Tiedje</name>
</author>
<author>
<name sortKey="Cole, Jr" uniqKey="Cole J">JR Cole</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Meinicke, P" uniqKey="Meinicke P">P Meinicke</name>
</author>
<author>
<name sortKey="A Hauer, Kp" uniqKey="A Hauer K">KP Aßhauer</name>
</author>
<author>
<name sortKey="Lingner, T" uniqKey="Lingner T">T Lingner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Koslicki, D" uniqKey="Koslicki D">D Koslicki</name>
</author>
<author>
<name sortKey="Foucart, S" uniqKey="Foucart S">S Foucart</name>
</author>
<author>
<name sortKey="Rosen, G" uniqKey="Rosen G">G Rosen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ong, Sh" uniqKey="Ong S">SH Ong</name>
</author>
<author>
<name sortKey="Kukkillaya, Vu" uniqKey="Kukkillaya V">VU Kukkillaya</name>
</author>
<author>
<name sortKey="Wilm, A" uniqKey="Wilm A">A Wilm</name>
</author>
<author>
<name sortKey="Lay, C" uniqKey="Lay C">C Lay</name>
</author>
<author>
<name sortKey="Ho, Exp" uniqKey="Ho E">EXP Ho</name>
</author>
<author>
<name sortKey="Low, L" uniqKey="Low L">L Low</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Droge, J" uniqKey="Droge J">J Dröge</name>
</author>
<author>
<name sortKey="Gregor, I" uniqKey="Gregor I">I Gregor</name>
</author>
<author>
<name sortKey="Mchardy, A" uniqKey="Mchardy A">A McHardy</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cai, Y" uniqKey="Cai Y">Y Cai</name>
</author>
<author>
<name sortKey="Sun, Y" uniqKey="Sun Y">Y Sun</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Edgar, Rc" uniqKey="Edgar R">RC Edgar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cheng, L" uniqKey="Cheng L">L Cheng</name>
</author>
<author>
<name sortKey="Walker, Aw" uniqKey="Walker A">AW Walker</name>
</author>
<author>
<name sortKey="Corander, J" uniqKey="Corander J">J Corander</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huson, Dh" uniqKey="Huson D">DH Huson</name>
</author>
<author>
<name sortKey="Auch, Af" uniqKey="Auch A">AF Auch</name>
</author>
<author>
<name sortKey="Qi, J" uniqKey="Qi J">J Qi</name>
</author>
<author>
<name sortKey="Schuster, Sc" uniqKey="Schuster S">SC Schuster</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mitra, S" uniqKey="Mitra S">S Mitra</name>
</author>
<author>
<name sortKey="St Rk, M" uniqKey="St Rk M">M Stärk</name>
</author>
<author>
<name sortKey="Huson, Dh" uniqKey="Huson D">DH Huson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Von Mering, C" uniqKey="Von Mering C">C von Mering</name>
</author>
<author>
<name sortKey="Hugenholtz, P" uniqKey="Hugenholtz P">P Hugenholtz</name>
</author>
<author>
<name sortKey="Raes, J" uniqKey="Raes J">J Raes</name>
</author>
<author>
<name sortKey="Tringe, Sg" uniqKey="Tringe S">SG Tringe</name>
</author>
<author>
<name sortKey="Doerks, T" uniqKey="Doerks T">T Doerks</name>
</author>
<author>
<name sortKey="Jensen, Lj" uniqKey="Jensen L">LJ Jensen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rosen, G" uniqKey="Rosen G">G Rosen</name>
</author>
<author>
<name sortKey="Garbarine, E" uniqKey="Garbarine E">E Garbarine</name>
</author>
<author>
<name sortKey="Caseiro, D" uniqKey="Caseiro D">D Caseiro</name>
</author>
<author>
<name sortKey="Polikar, R" uniqKey="Polikar R">R Polikar</name>
</author>
<author>
<name sortKey="Sokhansanj, B" uniqKey="Sokhansanj B">B Sokhansanj</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rosen, G" uniqKey="Rosen G">G Rosen</name>
</author>
<author>
<name sortKey="Reichenberger, E" uniqKey="Reichenberger E">E Reichenberger</name>
</author>
<author>
<name sortKey="Rosenfeld, A" uniqKey="Rosenfeld A">A Rosenfeld</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chatterjee, S" uniqKey="Chatterjee S">S Chatterjee</name>
</author>
<author>
<name sortKey="Koslicki, D" uniqKey="Koslicki D">D Koslicki</name>
</author>
<author>
<name sortKey="Dong, S" uniqKey="Dong S">S Dong</name>
</author>
<author>
<name sortKey="Innocenti, N" uniqKey="Innocenti N">N Innocenti</name>
</author>
<author>
<name sortKey="Cheng, L" uniqKey="Cheng L">L Cheng</name>
</author>
<author>
<name sortKey="Lan, Y" uniqKey="Lan Y">Y Lan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Linde, Y" uniqKey="Linde Y">Y Linde</name>
</author>
<author>
<name sortKey="Buzo, A" uniqKey="Buzo A">A Buzo</name>
</author>
<author>
<name sortKey="Gray, Rm" uniqKey="Gray R">RM Gray</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chatterjee, S" uniqKey="Chatterjee S">S Chatterjee</name>
</author>
<author>
<name sortKey="Sreenivas, Tv" uniqKey="Sreenivas T">TV Sreenivas</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chatterjee, S" uniqKey="Chatterjee S">S Chatterjee</name>
</author>
<author>
<name sortKey="Sreenivas, Tv" uniqKey="Sreenivas T">TV Sreenivas</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ambat, Sk" uniqKey="Ambat S">SK Ambat</name>
</author>
<author>
<name sortKey="Chatterjee, S" uniqKey="Chatterjee S">S Chatterjee</name>
</author>
<author>
<name sortKey="Hari, Kvs" uniqKey="Hari K">KVS Hari</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Otu, Hh" uniqKey="Otu H">HH Otu</name>
</author>
<author>
<name sortKey="Sayood, K" uniqKey="Sayood K">K Sayood</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Duda, Ro" uniqKey="Duda R">RO Duda</name>
</author>
<author>
<name sortKey="Hart, Pe" uniqKey="Hart P">PE Hart</name>
</author>
<author>
<name sortKey="Stork, Dg" uniqKey="Stork D">DG Stork</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Angly, Fe" uniqKey="Angly F">FE Angly</name>
</author>
<author>
<name sortKey="Willner, D" uniqKey="Willner D">D Willner</name>
</author>
<author>
<name sortKey="Rohwer, F" uniqKey="Rohwer F">F Rohwer</name>
</author>
<author>
<name sortKey="Hugenholtz, P" uniqKey="Hugenholtz P">P Hugenholtz</name>
</author>
<author>
<name sortKey="Tyson, Gw" uniqKey="Tyson G">GW Tyson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Balzer, S" uniqKey="Balzer S">S Balzer</name>
</author>
<author>
<name sortKey="Malde, K" uniqKey="Malde K">K Malde</name>
</author>
<author>
<name sortKey="Lanzen, A" uniqKey="Lanzen A">A Lanzén</name>
</author>
<author>
<name sortKey="Sharma, A" uniqKey="Sharma A">A Sharma</name>
</author>
<author>
<name sortKey="Jonassen, I" uniqKey="Jonassen I">I Jonassen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Claesson, Mj" uniqKey="Claesson M">MJ Claesson</name>
</author>
<author>
<name sortKey="Wang, Q" uniqKey="Wang Q">Q Wang</name>
</author>
<author>
<name sortKey="O Ullivan, O" uniqKey="O Ullivan O">O O’Sullivan</name>
</author>
<author>
<name sortKey="Greene Diniz, R" uniqKey="Greene Diniz R">R Greene-Diniz</name>
</author>
<author>
<name sortKey="Cole, Jr" uniqKey="Cole J">JR Cole</name>
</author>
<author>
<name sortKey="Ross, Rp" uniqKey="Ross R">RP Ross</name>
</author>
<author>
<name sortKey="O Oole, Pw" uniqKey="O Oole P">PW O’Toole</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Koslicki, D" uniqKey="Koslicki D">D Koslicki</name>
</author>
<author>
<name sortKey="Foucart, S" uniqKey="Foucart S">S Foucart</name>
</author>
<author>
<name sortKey="Rosen, G" uniqKey="Rosen G">G Rosen</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">PLoS One</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS ONE</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">plosone</journal-id>
<journal-title-group>
<journal-title>PLoS ONE</journal-title>
</journal-title-group>
<issn pub-type="epub">1932-6203</issn>
<publisher>
<publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, CA USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">26496191</article-id>
<article-id pub-id-type="pmc">4619776</article-id>
<article-id pub-id-type="publisher-id">PONE-D-15-15487</article-id>
<article-id pub-id-type="doi">10.1371/journal.pone.0140644</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition</article-title>
<alt-title alt-title-type="running-head">ARK</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Koslicki</surname>
<given-names>David</given-names>
</name>
<xref ref-type="aff" rid="aff001">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Chatterjee</surname>
<given-names>Saikat</given-names>
</name>
<xref ref-type="aff" rid="aff002">
<sup>2</sup>
</xref>
<xref ref-type="corresp" rid="cor001">*</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Shahrivar</surname>
<given-names>Damon</given-names>
</name>
<xref ref-type="aff" rid="aff002">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Walker</surname>
<given-names>Alan W.</given-names>
</name>
<xref ref-type="aff" rid="aff003">
<sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Francis</surname>
<given-names>Suzanna C.</given-names>
</name>
<xref ref-type="aff" rid="aff004">
<sup>4</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Fraser</surname>
<given-names>Louise J.</given-names>
</name>
<xref ref-type="aff" rid="aff005">
<sup>5</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Vehkaperä</surname>
<given-names>Mikko</given-names>
</name>
<xref ref-type="aff" rid="aff006">
<sup>6</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Lan</surname>
<given-names>Yueheng</given-names>
</name>
<xref ref-type="aff" rid="aff007">
<sup>7</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Corander</surname>
<given-names>Jukka</given-names>
</name>
<xref ref-type="aff" rid="aff008">
<sup>8</sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff001">
<label>1</label>
<addr-line>Dept of Mathematics, Oregon State University, Corvallis, United States of America</addr-line>
</aff>
<aff id="aff002">
<label>2</label>
<addr-line>Dept of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden</addr-line>
</aff>
<aff id="aff003">
<label>3</label>
<addr-line>Microbiology Group, Rowett Institute of Nutrition and Health, University of Aberdeen, Aberdeen, United Kingdom</addr-line>
</aff>
<aff id="aff004">
<label>4</label>
<addr-line>MRC Tropical Epidemiology Group, London School of Hygiene and Tropical Medicine, London, United Kingdom</addr-line>
</aff>
<aff id="aff005">
<label>5</label>
<addr-line>Illumina Cambridge Ltd., Chesterford Research Park, Essex, United Kingdom</addr-line>
</aff>
<aff id="aff006">
<label>6</label>
<addr-line>Dept of Electronic and Electrical Engineering, University of Sheffield, Sheffield, United Kingdom</addr-line>
</aff>
<aff id="aff007">
<label>7</label>
<addr-line>Dept of Physics, Tsinghua University, Beijing, China</addr-line>
</aff>
<aff id="aff008">
<label>8</label>
<addr-line>Dept of Mathematics and Statistics, University of Helsinki, Helsinki, Finland</addr-line>
</aff>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Badger</surname>
<given-names>Jonathan H.</given-names>
</name>
<role>Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1">
<addr-line>National Cancer Institute, UNITED STATES</addr-line>
</aff>
<author-notes>
<fn fn-type="COI-statement" id="coi001">
<p>
<bold>Competing Interests: </bold>
L.J.F. received funding in the form of salary from Illumina Cambridge Ltd. This does not alter the authors’ adherence to all the PLOS ONE policies on sharing data and materials.</p>
</fn>
<fn fn-type="con" id="contrib001">
<p>Conceived and designed the experiments: SC DK DS. Performed the experiments: DK SC DS. Analyzed the data: SC DK AWW JC. Contributed reagents/materials/analysis tools: AWW SCF LJF JC. Wrote the paper: DK SC AWW MV YL JC. Led the team: SC.</p>
</fn>
<corresp id="cor001">* E-mail:
<email>sach@kth.se</email>
</corresp>
</author-notes>
<pub-date pub-type="collection">
<year>2015</year>
</pub-date>
<pub-date pub-type="epub">
<day>23</day>
<month>10</month>
<year>2015</year>
</pub-date>
<volume>10</volume>
<issue>10</issue>
<elocation-id>e0140644</elocation-id>
<history>
<date date-type="received">
<day>20</day>
<month>4</month>
<year>2015</year>
</date>
<date date-type="accepted">
<day>28</day>
<month>9</month>
<year>2015</year>
</date>
</history>
<permissions>
<license xlink:href="https://creativecommons.org/publicdomain/zero/1.0/">
<license-p>This is an open-access article distributed under the terms of the Creative Commons Public Domain declaration, which stipulates that, once placed in the public domain, this work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:type="simple" xlink:href="pone.0140644.pdf"></self-uri>
<abstract>
<sec id="sec001">
<title>Motivation</title>
<p>Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging.</p>
</sec>
<sec id="sec002">
<title>Results</title>
<p>There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order
<italic>k</italic>
-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The
<italic>aggregation of reads</italic>
is a
<italic>pre-processing</italic>
approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of
<italic>k</italic>
-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity.</p>
</sec>
<sec id="sec003">
<title>Availability</title>
<p>An open source, platform-independent implementation of the method in the Julia programming language is freely available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/dkoslicki/ARK">https://github.com/dkoslicki/ARK</ext-link>
. A Matlab implementation is available at
<ext-link ext-link-type="uri" xlink:href="http://www.ee.kth.se/ctsoftware">http://www.ee.kth.se/ctsoftware</ext-link>
.</p>
</sec>
</abstract>
<funding-group>
<funding-statement>This work was supported by the Swedish Research Council Linnaeus Centre ACCESS (S.C.), ERC grant 239784 (J.C.), the Academy of Finland Center of Excellence COIN (J.C.), the Academy of Finland (M.V.), the Scottish Government’s Rural and Environment Science and Analytical Services Division (RESAS) (A.W.W), and the UK MRC/DFID grant G1002369 (S.C.F). L.J.F. received funding in the form of salary from Illumina Cambridge Ltd. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</funding-statement>
</funding-group>
<counts>
<fig-count count="10"></fig-count>
<table-count count="0"></table-count>
<page-count count="16"></page-count>
</counts>
<custom-meta-group>
<custom-meta id="data-availability">
<meta-name>Data Availability</meta-name>
<meta-value>The data and programs are available from GitHub (
<ext-link ext-link-type="uri" xlink:href="https://github.com/dkoslicki/ARK">https://github.com/dkoslicki/ARK</ext-link>
) or from here (
<ext-link ext-link-type="uri" xlink:href="http://www.ee.kth.se/ctsoftware">http://www.ee.kth.se/ctsoftware</ext-link>
). Further real biological data that we used for experiment have been submitted to the European Nucleotide Archive using the accession number PRJEB9828.</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
<notes>
<title>Data Availability</title>
<p>The data and programs are available from GitHub (
<ext-link ext-link-type="uri" xlink:href="https://github.com/dkoslicki/ARK">https://github.com/dkoslicki/ARK</ext-link>
) or from here (
<ext-link ext-link-type="uri" xlink:href="http://www.ee.kth.se/ctsoftware">http://www.ee.kth.se/ctsoftware</ext-link>
). Further real biological data that we used for experiment have been submitted to the European Nucleotide Archive using the accession number PRJEB9828.</p>
</notes>
</front>
<body>
<sec sec-type="intro" id="sec004">
<title>Introduction</title>
<p>The advent of high-throughput sequencing technologies has enabled detection of bacterial community composition at an unprecedented level of detail. A technological approach is to produce for each sample a large number of reads from amplicons of the 16S rRNA gene, which enables an identification and comparison of the relative frequencies of different taxonomic units present across samples. The rapidly increasing number of reads produced per sample results in the need for fast taxonomic classification of samples. This problem has attracted considerable recent attention [
<xref rid="pone.0140644.ref001" ref-type="bibr">1</xref>
<xref rid="pone.0140644.ref005" ref-type="bibr">5</xref>
].</p>
<p>Many existing approaches to the bacterial community composition estimation problem use 16S rRNA gene amplicon sequencing where a large amount of moderate length reads (around 250–500 bp) are produced from each sample and then generally either clustered or classified to obtain a composition estimate of taxonomic units. In the clustering approach, reads are grouped into taxonomic units by either distance-based or probabilistic methods [
<xref rid="pone.0140644.ref006" ref-type="bibr">6</xref>
<xref rid="pone.0140644.ref008" ref-type="bibr">8</xref>
], such that the actual taxonomic labels are assigned to the clusters afterwards by matching their consensus sequences to a reference database. In contrast to the clustering methods, the classification approach is based on using a reference database directly to assign reads to meaningful biological units. Methods for the classification of reads have been based either on homology using sequence similarity, or on genomic signatures in terms of
<italic>k</italic>
-mer composition. Examples of homology-based methods include MEGAN [
<xref rid="pone.0140644.ref009" ref-type="bibr">9</xref>
,
<xref rid="pone.0140644.ref010" ref-type="bibr">10</xref>
] and phylogenetic analysis [
<xref rid="pone.0140644.ref011" ref-type="bibr">11</xref>
]. Another popular approach is to use a Bayesian classifier [
<xref rid="pone.0140644.ref001" ref-type="bibr">1</xref>
,
<xref rid="pone.0140644.ref012" ref-type="bibr">12</xref>
,
<xref rid="pone.0140644.ref013" ref-type="bibr">13</xref>
]. One such method, the Ribosomal Database Project’s (RDP) naïve Bayesian classifier (NBC) [
<xref rid="pone.0140644.ref001" ref-type="bibr">1</xref>
], assigns a label explicitly to each read produced for a particular sample. Despite the methodological simplicity of NBC, the RDP classifier may still require several days to process a large data set in a desktop environment due to the read-by-read classification approach. Given this challenge, considerably faster estimation methods based on mixtures of
<italic>k</italic>
-mer counts have been developed, for example, Taxy [
<xref rid="pone.0140644.ref002" ref-type="bibr">2</xref>
], Quikr [
<xref rid="pone.0140644.ref003" ref-type="bibr">3</xref>
] and the recently proposed SEK [
<xref rid="pone.0140644.ref014" ref-type="bibr">14</xref>
]. Taxy is a convex-optimization based method. SEK and Quikr are sparse signal processing based methods (inspired by compressed sensing and convex-optimization), and SEK was shown to perform better than Quikr and Taxy in [
<xref rid="pone.0140644.ref014" ref-type="bibr">14</xref>
].</p>
<p>Taxy, Quikr and SEK all use as their main input a (statistical) mean vector of sample
<italic>k</italic>
-mer counts computed from the reads obtained for a sample. The
<italic>k</italic>
-mer counts (also called
<italic>k</italic>
-mers) are feature vectors extracted from raw sequence data. The necessary modeling assumption is that the sample mean vector of
<italic>k</italic>
-mer counts (that means first order statistics) is sufficiently informative about the sample composition. These three methods do not use the reads in any additional way once the mean vector of
<italic>k</italic>
-mers is computed. We propose here an alternative basis of information aggregation that remains computationally tractable to allow processing of large sets of reads. Borrowing ideas from source coding in signal processing [
<xref rid="pone.0140644.ref015" ref-type="bibr">15</xref>
,
<xref rid="pone.0140644.ref016" ref-type="bibr">16</xref>
], clustering in machine learning and source coding [
<xref rid="pone.0140644.ref017" ref-type="bibr">17</xref>
], fusion in signal estimation [
<xref rid="pone.0140644.ref018" ref-type="bibr">18</xref>
] and divide-and-conquer based shotgun sequence assembly [
<xref rid="pone.0140644.ref019" ref-type="bibr">19</xref>
], our novel approach first segregates the full set of reads into subsets (in the
<italic>k</italic>
-mers feature space), computes the mean vector for each subset, employs a standard method (such as Taxy, Quikr or SEK) to estimate composition for each subset, and finally fuses these estimates into a composition estimate jointly for all the reads. To segregate the reads into subsets, we choose to employ the K-means clustering algorithm [
<xref rid="pone.0140644.ref020" ref-type="bibr">20</xref>
]. Since the K-means clustering algorithm is simple and computationally inexpensive for a reasonable number
<italic>Q</italic>
of clusters (subsets), it can be used to partition even fairly large sets of reads into more (intra) homogeneous subsets. By its very algorithmic nature, K-means clustering partitions the feature space into
<italic>Q</italic>
non-overlapping regions and provides a set of corresponding mean vectors. This is called
<italic>codebook generation</italic>
in vector quantization [
<xref rid="pone.0140644.ref015" ref-type="bibr">15</xref>
], originally from signal processing, coding and clustering. Our new method is termed as Aggregation of Reads by K-means (ARK). From the statistical perspective, theoretical justification of ARK stems from a modeling framework with a mixture of densities.</p>
</sec>
<sec sec-type="materials|methods" id="sec005">
<title>Methods</title>
<sec id="sec006">
<title>Summarizing read sequence data by single mean
<italic>k</italic>
-mer counts</title>
<p>In the method description, we denote the non-negative real line by ℝ
<sub>+</sub>
and statistical expectation operator by 𝔼[.]. First, we describe the previously published approach of using single k-mer summaries for each sample. Let
<inline-formula id="pone.0140644.e001">
<alternatives>
<graphic xlink:href="pone.0140644.e001.jpg" id="pone.0140644.e001g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M1">
<mml:mrow>
<mml:mtext mathvariant="bold">x</mml:mtext>
<mml:mo></mml:mo>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
</mml:msubsup>
</mml:mrow>
</mml:math>
</alternatives>
</inline-formula>
and 𝒞
<sub>
<italic>m</italic>
</sub>
denote random
<italic>k</italic>
-mer feature vectors and
<italic>m</italic>
th taxonomic unit, respectively. Given a test set of
<italic>k</italic>
-mers (computed from reads), the distribution of the test set is modeled as
<disp-formula id="pone.0140644.e002">
<alternatives>
<graphic xlink:href="pone.0140644.e002.jpg" id="pone.0140644.e002g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M2">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>M</mml:mi>
</mml:munderover>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi mathvariant="script">C</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mi mathvariant="script">C</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(1)</label>
</disp-formula>
where we denote probability for taxonomic unit
<italic>m</italic>
(or class weight) by
<italic>p</italic>
(𝒞
<sub>
<italic>m</italic>
</sub>
), satisfying
<inline-formula id="pone.0140644.e003">
<alternatives>
<graphic xlink:href="pone.0140644.e003.jpg" id="pone.0140644.e003g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M3">
<mml:mrow>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>M</mml:mi>
</mml:msubsup>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mo>𝒞</mml:mo>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</alternatives>
</inline-formula>
. Note that
<inline-formula id="pone.0140644.e004">
<alternatives>
<graphic xlink:href="pone.0140644.e004.jpg" id="pone.0140644.e004g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M4">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mo>𝒞</mml:mo>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>M</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</alternatives>
</inline-formula>
is the composition of taxonomic units in the given test set (reads). The inference task is to estimate
<italic>p</italic>
(𝒞
<sub>
<italic>m</italic>
</sub>
) as accurately as possible with a reasonable computational resource. Let us derive the mean vector
<disp-formula id="pone.0140644.e005">
<alternatives>
<graphic xlink:href="pone.0140644.e005.jpg" id="pone.0140644.e005g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M5">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi mathvariant="double-struck">E</mml:mi>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mo></mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mi>d</mml:mi>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo></mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:munderover>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>M</mml:mi>
</mml:munderover>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi mathvariant="script">C</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mi mathvariant="script">C</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mi>d</mml:mi>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>M</mml:mi>
</mml:munderover>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi mathvariant="script">C</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mo></mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mi mathvariant="script">C</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mi>d</mml:mi>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(2)</label>
</disp-formula>
The mean 𝔼[
<bold>x</bold>
] contains information about
<italic>p</italic>
(𝒞
<sub>
<italic>m</italic>
</sub>
) in this probabilistic formulation. In practice, the information summary is obtained by computing the sample mean from the complete set of reads available for a sample. Let us denote the sample mean of
<italic>k</italic>
-mers feature vectors of reads by
<inline-formula id="pone.0140644.e006">
<alternatives>
<graphic xlink:href="pone.0140644.e006.jpg" id="pone.0140644.e006g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M6">
<mml:mrow>
<mml:mi mathvariant="bold-italic">μ</mml:mi>
<mml:mo></mml:mo>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
</mml:msubsup>
</mml:mrow>
</mml:math>
</alternatives>
</inline-formula>
with the assumption that
<bold>
<italic>μ</italic>
</bold>
≈ 𝔼[
<bold>x</bold>
]. Several methods, such as Taxy [
<xref rid="pone.0140644.ref002" ref-type="bibr">2</xref>
], Quikr [
<xref rid="pone.0140644.ref003" ref-type="bibr">3</xref>
], and SEK [
<xref rid="pone.0140644.ref014" ref-type="bibr">14</xref>
] use the sample mean
<bold>
<italic>μ</italic>
</bold>
directly as the main input to compute the composition
<italic>p</italic>
(𝒞
<sub>
<italic>m</italic>
</sub>
).</p>
</sec>
<sec id="sec007">
<title>Aggregation of reads by K-means (ARK)</title>
<p>For the above-described principle of information aggregation from the reads by the mean vector of
<italic>k</italic>
-mer counts, computation of the sample mean vector is straightforward. This consequently enables handling of a very large amount of reads with low computational cost. However, we hypothesize that the sample mean vector computed from the full set of reads is not sufficient in terms of information content to facilitate accurate estimation of
<italic>p</italic>
(𝒞
<sub>
<italic>m</italic>
</sub>
). Indeed, since typically the number of training taxonomic units
<italic>M</italic>
is much larger than the number of
<italic>k</italic>
-mers (for example
<italic>k</italic>
= 6), the set of
<italic>k</italic>
-mer vectors for
<inline-formula id="pone.0140644.e007">
<alternatives>
<graphic xlink:href="pone.0140644.e007.jpg" id="pone.0140644.e007g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M7">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:msub>
<mml:mo>𝒞</mml:mo>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>M</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</alternatives>
</inline-formula>
is not linearly independent, and so we risk reconstructing a mixture of taxonomic units as a single taxonomic unit. Hence, we segregate the reads into several subsets and compute a sample mean vector separately for each subset, assuming that a set of sample mean vectors is more informative than a single mean vector. Note that in the case where the resulting read subsets were not in practice distinct from each other in terms of their
<italic>k</italic>
-mer counts, the subsequent composition estimate would effectively be identical to the estimate obtained with a single data summary described in Eqs (
<xref ref-type="disp-formula" rid="pone.0140644.e002">1</xref>
) and (
<xref ref-type="disp-formula" rid="pone.0140644.e005">2</xref>
).</p>
<p>Let us partition the
<italic>k</italic>
-mers feature space
<inline-formula id="pone.0140644.e008">
<alternatives>
<graphic xlink:href="pone.0140644.e008.jpg" id="pone.0140644.e008g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M8">
<mml:mrow>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
</mml:msubsup>
</mml:mrow>
</mml:math>
</alternatives>
</inline-formula>
into
<italic>Q</italic>
non-overlapping regions 𝓡
<sub>
<italic>q</italic>
</sub>
such that
<inline-formula id="pone.0140644.e009">
<alternatives>
<graphic xlink:href="pone.0140644.e009.jpg" id="pone.0140644.e009g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M9">
<mml:mrow>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>Q</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mo>𝓡</mml:mo>
<mml:mi>q</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
</mml:msubsup>
</mml:mrow>
</mml:math>
</alternatives>
</inline-formula>
and ∀
<italic>q</italic>
,
<italic>r</italic>
,
<italic>q</italic>
<italic>r</italic>
, 𝓡
<sub>
<italic>q</italic>
</sub>
∩ 𝓡
<sub>
<italic>r</italic>
</sub>
= ∅. Such partitions can be formed by a standard K-means algorithm that typically uses a nearest neighbor classification rule based on square Euclidean distance measure. The non-overlapping regions 𝓡
<sub>
<italic>q</italic>
</sub>
are called Voronoi regions. We define
<italic>P</italic>
<sub>
<italic>q</italic>
</sub>
≜ Pr(
<bold>x</bold>
∈ 𝓡
<sub>
<italic>q</italic>
</sub>
) satisfying
<inline-formula id="pone.0140644.e010">
<alternatives>
<graphic xlink:href="pone.0140644.e010.jpg" id="pone.0140644.e010g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M10">
<mml:mrow>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>Q</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>q</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</alternatives>
</inline-formula>
. In practice,
<italic>P</italic>
<sub>
<italic>q</italic>
</sub>
is computed as
<disp-formula id="pone.0140644.e011">
<alternatives>
<graphic xlink:href="pone.0140644.e011.jpg" id="pone.0140644.e011g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M11">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>q</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mtext>number</mml:mtext>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mtext>of</mml:mtext>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mtext>feature</mml:mtext>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mtext>vectors</mml:mtext>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mtext>in</mml:mtext>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:msub>
<mml:mi mathvariant="script">R</mml:mi>
<mml:mi mathvariant="normal">q</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mtext>total</mml:mtext>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mtext>number</mml:mtext>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mtext>of</mml:mtext>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mtext>feature</mml:mtext>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mtext>vectors</mml:mtext>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(3)</label>
</disp-formula>
It is reminded that the feature vectors are
<italic>k</italic>
-mers. The distribution of the full test set and subsets can be written as
<disp-formula id="pone.0140644.e012">
<alternatives>
<graphic xlink:href="pone.0140644.e012.jpg" id="pone.0140644.e012g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M12">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd></mml:mtd>
<mml:mtd></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>Q</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>q</mml:mi>
</mml:msub>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi mathvariant="script">R</mml:mi>
<mml:mi>q</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd></mml:mtd>
<mml:mtd></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi mathvariant="script">R</mml:mi>
<mml:mi>q</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>M</mml:mi>
</mml:msubsup>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi mathvariant="script">C</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi mathvariant="script">R</mml:mi>
<mml:mi>q</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mi mathvariant="script">C</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi mathvariant="script">R</mml:mi>
<mml:mi>q</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(4)</label>
</disp-formula>
where the first equation follows a standard mixture density framework. Now, if we can estimate
<italic>p</italic>
(𝒞
<sub>
<italic>m</italic>
</sub>
<bold>x</bold>
∈ 𝓡
<sub>
<italic>q</italic>
</sub>
), then the final quantity of interest
<italic>p</italic>
(𝒞
<sub>
<italic>m</italic>
</sub>
) can be estimated as
<disp-formula id="pone.0140644.e013">
<alternatives>
<graphic xlink:href="pone.0140644.e013.jpg" id="pone.0140644.e013g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M13">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi mathvariant="script">C</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>Q</mml:mi>
</mml:munderover>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>q</mml:mi>
</mml:msub>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi mathvariant="script">C</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi mathvariant="script">R</mml:mi>
<mml:mi>q</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(5)</label>
</disp-formula>
The estimation of
<italic>p</italic>
(𝒞
<sub>
<italic>m</italic>
</sub>
) in
<xref ref-type="disp-formula" rid="pone.0140644.e013">Eq (5)</xref>
is a judicious fusion of
<italic>p</italic>
(𝒞
<sub>
<italic>m</italic>
</sub>
<bold>x</bold>
∈ 𝓡
<sub>
<italic>q</italic>
</sub>
) through a linear combination. Let us now derive the mean vector for 𝓡
<sub>
<italic>q</italic>
</sub>
, which is a conditional mean vector
<disp-formula id="pone.0140644.e014">
<alternatives>
<graphic xlink:href="pone.0140644.e014.jpg" id="pone.0140644.e014g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M14">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd></mml:mtd>
<mml:mtd></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mi mathvariant="double-struck">E</mml:mi>
<mml:mo>[</mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi mathvariant="script">R</mml:mi>
<mml:mi>q</mml:mi>
</mml:msub>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd></mml:mtd>
<mml:mtd></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mo>=</mml:mo>
<mml:mo></mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mi>p</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi mathvariant="script">R</mml:mi>
<mml:mi>q</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mi>d</mml:mi>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd></mml:mtd>
<mml:mtd></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>M</mml:mi>
</mml:msubsup>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi mathvariant="script">C</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi mathvariant="script">R</mml:mi>
<mml:mi>q</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mo></mml:mo>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mi mathvariant="script">C</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi mathvariant="script">R</mml:mi>
<mml:mi>q</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mi>d</mml:mi>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
<label>(6)</label>
</disp-formula>
The mean 𝔼[
<bold>x</bold>
<bold>x</bold>
∈ 𝓡
<sub>
<italic>q</italic>
</sub>
] contains information about
<italic>p</italic>
(𝒞
<sub>
<italic>m</italic>
</sub>
<bold>x</bold>
∈ 𝓡
<sub>
<italic>q</italic>
</sub>
). In practice we use the sample mean denoted by
<bold>
<italic>μ</italic>
</bold>
<sub>
<italic>q</italic>
</sub>
with the assumption that
<bold>
<italic>μ</italic>
</bold>
<sub>
<italic>q</italic>
</sub>
≈ 𝔼[
<bold>x</bold>
<bold>x</bold>
∈ 𝓡
<sub>
<italic>q</italic>
</sub>
]. Comparing Eqs (
<xref ref-type="disp-formula" rid="pone.0140644.e005">2</xref>
) and (
<xref ref-type="disp-formula" rid="pone.0140644.e014">6</xref>
), for the
<italic>q</italic>
th Voronoi region 𝓡
<sub>
<italic>q</italic>
</sub>
we can estimate composition
<italic>p</italic>
(𝒞
<sub>
<italic>m</italic>
</sub>
<bold>x</bold>
∈ 𝓡
<sub>
<italic>q</italic>
</sub>
) by using an appropriate composition estimation method, such as Taxy, Quikr or SEK.</p>
</sec>
<sec id="sec008">
<title>Algorithms</title>
<p>The ARK algorithm can be implemented by following steps.
<list list-type="order">
<list-item>
<p>Divide the full test dataset of
<italic>k</italic>
-mers into
<italic>Q</italic>
subsets. The region 𝓡
<sub>
<italic>q</italic>
</sub>
corresponds to the
<italic>q</italic>
th subset.</p>
</list-item>
<list-item>
<p>For the
<italic>q</italic>
th subset, compute
<italic>P</italic>
<sub>
<italic>q</italic>
</sub>
and the sample mean
<bold>
<italic>μ</italic>
</bold>
<sub>
<italic>q</italic>
</sub>
.</p>
</list-item>
<list-item>
<p>For the
<italic>q</italic>
th subset, apply a composition estimation method that uses the input
<bold>
<italic>μ</italic>
</bold>
<sub>
<italic>q</italic>
</sub>
; estimate
<italic>p</italic>
(𝒞
<sub>
<italic>m</italic>
</sub>
<bold>x</bold>
∈ 𝓡
<sub>
<italic>q</italic>
</sub>
).</p>
</list-item>
<list-item>
<p>Estimate
<italic>p</italic>
(𝒞
<sub>
<italic>m</italic>
</sub>
) by
<inline-formula id="pone.0140644.e015">
<alternatives>
<graphic xlink:href="pone.0140644.e015.jpg" id="pone.0140644.e015g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M15">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mo>𝒞</mml:mo>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>Q</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>q</mml:mi>
</mml:msub>
<mml:mspace width="0.167em"></mml:mspace>
<mml:mspace width="0.167em"></mml:mspace>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mo>𝒞</mml:mo>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo stretchy="false"></mml:mo>
<mml:mtext mathvariant="bold">x</mml:mtext>
<mml:mo></mml:mo>
<mml:msub>
<mml:mo>𝓡</mml:mo>
<mml:mi>q</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</alternatives>
</inline-formula>
.</p>
</list-item>
</list>
</p>
<p>The ARK method is described using a flow-chart in
<xref ref-type="fig" rid="pone.0140644.g001">Fig 1</xref>
. The flow-chart shows the main components of the overall system and the associated off-line and on-line computations. The crucial computational/statistical challenges related to the ARK algorithm outlined above are as follows:
<list list-type="order">
<list-item>
<p>What is an appropriate number of subsets
<italic>Q</italic>
?</p>
</list-item>
<list-item>
<p>How should one form the subsets 𝓡
<sub>
<italic>q</italic>
</sub>
?</p>
</list-item>
</list>
</p>
<fig id="pone.0140644.g001" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0140644.g001</object-id>
<label>Fig 1</label>
<caption>
<title>A flow-chart of the ARK method.</title>
</caption>
<graphic xlink:href="pone.0140644.g001"></graphic>
</fig>
<p>The above points are inherent to any subset forming algorithm, and more generally to any clustering algorithm. Furthermore, finding optimal regions (or clusters) requires alternative optimization techniques. Given a pre-defined
<italic>Q</italic>
, typically a K-means algorithm performs two alternating optimization steps. These are: (1) given a set of representation vectors
<inline-formula id="pone.0140644.e016">
<alternatives>
<graphic xlink:href="pone.0140644.e016.jpg" id="pone.0140644.e016g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M16">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">μ</mml:mi>
<mml:mi>q</mml:mi>
</mml:msub>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>Q</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</alternatives>
</inline-formula>
(also called code vectors) form new clusters
<inline-formula id="pone.0140644.e017">
<alternatives>
<graphic xlink:href="pone.0140644.e017.jpg" id="pone.0140644.e017g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M17">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:msub>
<mml:mo>𝓡</mml:mo>
<mml:mi>q</mml:mi>
</mml:msub>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>Q</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</alternatives>
</inline-formula>
by a nearest neighbor rule (or form new subsets from the full dataset), (2) find the set of cluster representation vectors given the assignment of data into clusters. The optimal representation vector is the mean vector if squared Euclidean distance is used for the nearest neighbor rule. The K-means algorithm initializes with a set of representative vectors and runs alternating optimization until convergence in the sense that the average squared Euclidean distance is no longer reduced. In the present paper we perform the clustering using a popular vector quantization method called the Linde-Buzo-Gray (LBG) algorithm [
<xref rid="pone.0140644.ref015" ref-type="bibr">15</xref>
] (or source coding literature). There are several variants of the LBG available. In one variant, the algorithm starts with
<italic>Q</italic>
= 1 and then slowly splits the dense and high probability clusters to end up with a high
<italic>Q</italic>
, such that it does not deviate significantly from an exponentially decaying bit rate versus coding distortion (rate-distortion) curve.</p>
<p>In ARK, we use the following two strategies to solve the two challenges listed above.
<list list-type="order">
<list-item>
<p>Optimal/deterministic strategy: Start with
<italic>Q</italic>
= 1, which corresponds to the previous approach with a single mean vector as the data summary. Then set
<italic>Q</italic>
= 2 for LBG algorithm that uses square Euclidean distance as the distortion measure; the LBG algorithm minimizes mean of square Euclidean distance (also called mean square error). Initialization is done by a standard split approach where the mean vector is perturbed. Using
<italic>Q</italic>
= 2,
<inline-formula id="pone.0140644.e018">
<alternatives>
<graphic xlink:href="pone.0140644.e018.jpg" id="pone.0140644.e018g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M18">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:msub>
<mml:mo>𝓡</mml:mo>
<mml:mi>q</mml:mi>
</mml:msub>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msubsup>
</mml:mrow>
</mml:math>
</alternatives>
</inline-formula>
is formed and we estimate
<italic>p</italic>
(𝒞
<sub>
<italic>m</italic>
</sub>
). Subsequently,
<italic>Q</italic>
is increased by one until a convergence criterion is met. For
<italic>Q</italic>
≥ 3, we always split the highest ranking cluster into two subclusters and use the LBG algorithm to find the optimal clusters. The number of clusters
<italic>Q</italic>
is no longer increased if the estimated values of
<italic>p</italic>
(𝒞
<sub>
<italic>m</italic>
</sub>
) differ negligibly for
<italic>Q</italic>
and (
<italic>Q</italic>
− 1). In practice, the stopping condition we use is that the variational distance between
<italic>p</italic>
(𝒞
<sub>
<italic>m</italic>
</sub>
)∣
<sub>
<italic>Q</italic>
</sub>
and
<italic>p</italic>
(𝒞
<sub>
<italic>m</italic>
</sub>
)∣
<sub>(
<italic>Q</italic>
−1)</sub>
is less than a predetermined threshold. This condition can be written as
<inline-formula id="pone.0140644.e019">
<alternatives>
<graphic xlink:href="pone.0140644.e019.jpg" id="pone.0140644.e019g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M19">
<mml:mrow>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>M</mml:mi>
</mml:msubsup>
<mml:mtext mathvariant="normal">abs</mml:mtext>
<mml:mrow>
<mml:mo stretchy="true">(</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mo>𝒞</mml:mo>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:msub>
<mml:mo stretchy="false"></mml:mo>
<mml:mi>Q</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mo>𝒞</mml:mo>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:msub>
<mml:mo stretchy="false"></mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>Q</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="true">)</mml:mo>
</mml:mrow>
<mml:mo><</mml:mo>
<mml:mi>η</mml:mi>
</mml:mrow>
</mml:math>
</alternatives>
</inline-formula>
, with a user defined choice of the threshold
<italic>η</italic>
. Note that
<italic>η</italic>
∈ (0,1] provides an allowable limit as a scaled variational distance (VD) between two probability mass functions; a typical choice of
<italic>η</italic>
can be 0.01. This strategy is typically found to provide consistent performance improvement in the sense of estimating
<italic>p</italic>
(𝒞
<sub>
<italic>m</italic>
</sub>
) with the increase in
<italic>Q</italic>
by the step of one, but without absolute guarantee as the target optimization strategy minimizes mean square error. Furthermore, we allow an increment in the number of clusters up to a pre-defined maximum limit
<italic>Q</italic>
<sub>
<italic>max</italic>
</sub>
. Typically
<italic>Q</italic>
<sub>
<italic>max</italic>
</sub>
is preferably chosen as an integer power of two. A typical choice of
<italic>Q</italic>
<sub>
<italic>max</italic>
</sub>
can be between 16 to 256.</p>
</list-item>
<list-item>
<p>Non-optimal/random strategy: For very large test sets, we use a pre-determined
<italic>Q</italic>
and a random choice of the
<italic>Q</italic>
representation vectors. Then the full test set is divided into
<italic>Q</italic>
subsets by a nearest neighbor rule and we compute the set of
<italic>Q</italic>
mean vectors {
<bold>
<italic>μ</italic>
</bold>
<sub>
<italic>q</italic>
</sub>
}, and cluster probabilities {
<italic>P</italic>
<sub>
<italic>q</italic>
</sub>
}. Even though this non-optimal strategy does not use an alternating optimization (such as LBG algorithm) to form optimal clusters, it divides the full test set into sub-sets, resulting in a set of
<italic>Q</italic>
localized mean vectors across the full test set.</p>
</list-item>
</list>
</p>
<p>Finally we mention that the use of K-means is fully motivated by its simplicity and computational ease. Use of statistical K-means in the form of expectation-maximization based mixture modeling (for example, Gaussian mixture model) could have been investigated, but requires more computation to handle a large dataset of reads.</p>
</sec>
<sec id="sec009">
<title>Synthetic data generation for method evaluation</title>
<p>To evaluate the performance of the ARK method, we conducted experiments for simulated data as described below. For these, and all computations reported in the remainder of the paper, we used Matlab version R2013b (with some instances of C code), on a desktop workstation with an Intel Core i7 4930K processor and 64Gb of RAM.</p>
<sec id="sec010">
<title>Test datasets (Reads)</title>
<p>We simulated 180 16S rRNA gene 454-like datasets using the RDP training set 7 and the Grinder read simulator [
<xref rid="pone.0140644.ref021" ref-type="bibr">21</xref>
] targeting the V1–V2 and V3–V5 variable regions with read lengths fixed at 250 bp or normally distributed with a mean of 450 bp and variance 50 bp. Read depths were chosen to be either 10K, 100K or 250K, while three different read distributions were used: power law, uniform, and linear. Diversity was set at either 50, 100, or 500 taxa and chimera percentages were set to 5% or 35%. The Balzer model [
<xref rid="pone.0140644.ref022" ref-type="bibr">22</xref>
] was chosen for homopolymer errors, and copy bias was included while length bias was excluded.</p>
</sec>
<sec id="sec011">
<title>Training dataset (Reference)</title>
<p>In our ARK experiments we used Quikr [
<xref rid="pone.0140644.ref003" ref-type="bibr">3</xref>
] and SEK [
<xref rid="pone.0140644.ref014" ref-type="bibr">14</xref>
] to estimate
<italic>p</italic>
(𝒞
<sub>
<italic>m</italic>
</sub>
<bold>x</bold>
∈ 𝓡
<sub>
<italic>q</italic>
</sub>
). The RDP training set 7 was used as the base reference database for both Quikr and SEK. Note that this is the same as database
<italic>D</italic>
<sub>small</sub>
utilized in [
<xref rid="pone.0140644.ref003" ref-type="bibr">3</xref>
]. While in the main manuscript we use the same data for both training and testing the base methods (Quikr and SEK), in
<xref ref-type="supplementary-material" rid="pone.0140644.s001">S1 File</xref>
we include results obtained when the test datasets have taxa absent from the training database (that is, sister taxa have been excluded from the training database). As expected, all methods experience a loss in reconstruction accuracy when sister taxa are absent, but ARK Quikr and ARK SEK are still more accurate than RDP’s NBC.</p>
</sec>
</sec>
<sec id="sec012">
<title>Real biological data</title>
<p>To further evaluate ARK, we also utilized 28 Illumina MiSeq 16S rRNA gene human body-site associated samples, plus one negative control sample. The real data consist of a total of over 5.7 M reads distributed over three variable regions (V1–V2, V3–V4, and V3–V5) as well as two body sites (vagina and feces).</p>
<p>For each of these samples DNA was extracted using the FastDNA SPIN Kit for Soil with a FastPrep machine (MP Biomedicals) following the manufacturer’s protocol. 16S rRNA gene amplicons were generated from the DNA extractions using the primer combinations listed in Section 5 of
<xref ref-type="supplementary-material" rid="pone.0140644.s001">S1 File</xref>
. The Q5 High-fidelity polymerase kit (New England Biolabs) was used to amplify the 16S rRNA genes, and PCR conditions were as follows: 98°C for 2 minutes, followed by 20 cycles of 98°C for 30 seconds, 50°C for 30 seconds and 72°C for 1 minute 30 seconds, followed by a final extension step at 72°C for 5 minutes. Following PCR, the amplicons were then purified using the Wizard SV Gel and PCR Clean-Up kit (Promega, UK). Sequencing of 16S rRNA gene amplicons was carried out by Illumina Inc. (Little Chesterford, UK) using a MiSeq instrument run for 2 x 250 (V1–V2), 300 + 200 (V3–V4) and 400 + 200 (V3–V5) cycles. These data have been submitted to the European Nucleotide Archive using the accession number PRJEB9828.</p>
<p>After trimming 20 bp of primer off each read, the sequences were trimmed from the right until all bases had a quality score greater than 27. This reduced the total number of reads to approximately 4M, and reduced the mean read length from 315 bp to 257 bp. We then utilized all resulting unpaired reads (both forward and reverse) including any duplicate sequences. We include in
<xref ref-type="supplementary-material" rid="pone.0140644.s001">S1 File</xref>
results for an alternative error-correction protocol, as well as results for assembling paired-end reads (Figs E and F in
<xref ref-type="supplementary-material" rid="pone.0140644.s001">S1 File</xref>
).</p>
</sec>
<sec id="sec013">
<title>Ethics Statement</title>
<p>For human body-site associated samples, the faecal samples used were not part of a clinical study so there is no corresponding ethical approval or written consent. There were no clinical records. The samples are anonymised and de-identified. Further, vaginal samples were collected as part of an observational microbicide feasibility study. The study was approved by the Ethics Committees of the National Institute for Medical Research in Tanzania and London School of Hygiene and Tropical Medicine, and all participants gave written informed consent. All records were anonymized and de-identified prior to this retrospective analysis.</p>
</sec>
</sec>
<sec sec-type="results" id="sec014">
<title>Results</title>
<sec id="sec015">
<title>Performance measure and relevant methods</title>
<p>As a quantitative performance measure, we use variational distance (VD) to compare between known proportions of taxonomic units
<bold>p</bold>
= [
<italic>p</italic>
(𝒞
<sub>1</sub>
),
<italic>p</italic>
(𝒞
<sub>2</sub>
), …,
<italic>p</italic>
(𝒞
<sub>
<italic>M</italic>
</sub>
)]
<sup>
<italic>t</italic>
</sup>
and the estimated proportions
<inline-formula id="pone.0140644.e020">
<alternatives>
<graphic xlink:href="pone.0140644.e020.jpg" id="pone.0140644.e020g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M20">
<mml:mrow>
<mml:mover accent="true">
<mml:mtext mathvariant="bold">p</mml:mtext>
<mml:mo>^</mml:mo>
</mml:mover>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mover accent="true">
<mml:mi>p</mml:mi>
<mml:mo>^</mml:mo>
</mml:mover>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mo>𝒞</mml:mo>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>,</mml:mo>
<mml:mspace width="0.167em"></mml:mspace>
<mml:mover accent="true">
<mml:mi>p</mml:mi>
<mml:mo>^</mml:mo>
</mml:mover>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mo>𝒞</mml:mo>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:mover accent="true">
<mml:mi>p</mml:mi>
<mml:mo>^</mml:mo>
</mml:mover>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mo>𝒞</mml:mo>
<mml:mi>M</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</alternatives>
</inline-formula>
. The VD is defined as
<disp-formula id="pone.0140644.e021">
<alternatives>
<graphic xlink:href="pone.0140644.e021.jpg" id="pone.0140644.e021g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M21">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mrow>
<mml:mtext>VD</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>.</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo>×</mml:mo>
<mml:mo></mml:mo>
<mml:mi mathvariant="bold">p</mml:mi>
<mml:mo>-</mml:mo>
</mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold">p</mml:mi>
<mml:mo>^</mml:mo>
</mml:mover>
<mml:msub>
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</alternatives>
</disp-formula>
A low VD indicates more satisfactory performance.</p>
<p>For ARK, we used both SEK and Quikr as the underlying estimation methods applied to each cluster. These recent methods were chosen as appropriate representatives of fast and accurate sparse signal processing approaches. A
<italic>k</italic>
-mer size of
<italic>k</italic>
= 6 was used for both Quikr and SEK.</p>
<p>As part of the SEK pipeline, sequences in a given database are split into subsequences. We selected from the 10,046 sequences in the RDP training set 7 all sequences longer than 700 bp in length, and then split the sequences into subsequences of length 400 bp with 100 bp of overlap. This corresponds to setting
<italic>L</italic>
<sub>
<italic>w</italic>
</sub>
= 400 and
<italic>L</italic>
<sub>
<italic>p</italic>
</sub>
= 100 as specified in [
<xref rid="pone.0140644.ref014" ref-type="bibr">14</xref>
]. We used the SEK algorithm
<inline-formula id="pone.0140644.e022">
<alternatives>
<graphic xlink:href="pone.0140644.e022.jpg" id="pone.0140644.e022g" mimetype="image" position="anchor" orientation="portrait"></graphic>
<mml:math id="M22">
<mml:mrow>
<mml:msubsup>
<mml:mtext mathvariant="normal">OMP</mml:mtext>
<mml:mtext mathvariant="normal">sek</mml:mtext>
<mml:mrow>
<mml:mo>+</mml:mo>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</alternatives>
</inline-formula>
with parameters as in [
<xref rid="pone.0140644.ref014" ref-type="bibr">14</xref>
].</p>
</sec>
<sec id="sec016">
<title>Results for Simulated Data</title>
<sec id="sec017">
<title>Effect of increasing number of clusters</title>
<p>We first investigate how an increase in the number of clusters
<italic>Q</italic>
affects the composition reconstruction fidelity and algorithm execution time for the simulated data. Only the non-optimal/random strategy of K-means clustering was utilized as we found that the performance improvement for optimal/deterministic strategy was insignificant given the resulting increase in execution time (results not shown). Averaging the VD error at the genus level over all 180 simulated experiments, it was found that combining ARK with both SEK and Quikr resulted in a power law kind of decay of VD error as a function of the number of clusters (
<xref ref-type="fig" rid="pone.0140644.g002">Fig 2</xref>
). ARK causes a substantial increase in reconstruction fidelity which can be seen since using ARK SEK or ARK Quikr with one cluster is equivalent to running SEK or Quikr with no modification.</p>
<fig id="pone.0140644.g002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0140644.g002</object-id>
<label>Fig 2</label>
<caption>
<title>Results for the random K-means clustering on the simulated data.</title>
<p>Mean VD error at the genus level as a function of the number of clusters. Note the improvement that ARK contributes to each method.</p>
</caption>
<graphic xlink:href="pone.0140644.g002"></graphic>
</fig>
<p>Since the underlying algorithm (SEK or Quikr) must be executed on each cluster formed by the K-means clustering, we expect the total algorithm execution time to increase by a factor equal to the number of chosen clusters. As seen in
<xref ref-type="fig" rid="pone.0140644.g003">Fig 3</xref>
, both algorithms experience an increase in execution time roughly proportional to the number of clusters.</p>
<fig id="pone.0140644.g003" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0140644.g003</object-id>
<label>Fig 3</label>
<caption>
<title>Results for the random K-means clustering on the simulated data.</title>
<p>Mean execution time increase (factor given in comparison to running SEK or Quikr in the absence of ARK) as a function of number of clusters. The dashed line represents a line with slope 1.</p>
</caption>
<graphic xlink:href="pone.0140644.g003"></graphic>
</fig>
</sec>
<sec id="sec018">
<title>Fixed number of clusters</title>
<p>As seen above, given the decrease in VD as a function of the number of clusters, we also fixed the number of clusters
<italic>Q</italic>
to 75 to compare the performance of the underlying algorithms with and without ARK. There was a significant decrease in the VD error (as seen in
<xref ref-type="fig" rid="pone.0140644.g004">Fig 4</xref>
) at the cost of an increase in execution time (as seen in
<xref ref-type="fig" rid="pone.0140644.g005">Fig 5</xref>
). However, given the speed of both Quikr and SEK, we expect the addition of ARK will not result in prohibitively long execution times. Indeed, as seen above, on real biological data both ARK Quikr and ARK SEK are still several hours faster than the Ribosomal Database Project’s Naïve Bayesian Classifier (RDP’s NBC) [
<xref rid="pone.0140644.ref001" ref-type="bibr">1</xref>
], even when using 75 clusters.</p>
<fig id="pone.0140644.g004" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0140644.g004</object-id>
<label>Fig 4</label>
<caption>
<title>Comparison of the underlying algorithms with and without ARK.</title>
<p>Results are for the random K-means clustering on the simulated data when fixing the number of clusters to 75. Mean VD error at the genus level. Included for comparison are results for RDP’s NBC (compare to Fig 2(b) of [
<xref rid="pone.0140644.ref003" ref-type="bibr">3</xref>
]).</p>
</caption>
<graphic xlink:href="pone.0140644.g004"></graphic>
</fig>
<fig id="pone.0140644.g005" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0140644.g005</object-id>
<label>Fig 5</label>
<caption>
<title>Comparison of the underlying algorithms with and without ARK.</title>
<p>Results are for the random K-means clustering on the simulated data when fixing the number of clusters to 75. Boxplot of the individual simulated sample execution times. Mean execution times for Quikr and ARK Quikr were 1.75 seconds and 4.71 minutes, while for SEK and ARK SEK they were 21.26 seconds and 19.21 minutes respectively. Mean execution time for RDP’s NBC was 38.19 minutes.</p>
</caption>
<graphic xlink:href="pone.0140644.g005"></graphic>
</fig>
</sec>
</sec>
<sec id="sec019">
<title>Real Biological Data</title>
<p>We used ARK combined with SEK and Quikr to analyze the real biological data and compared these results to those obtained from the RDP’s NBC. All methods used RDP’s training set 7 as the underlying training database. The random K-means clustering was used for the ARK method, and the number of clusters
<italic>Q</italic>
was set to 75.
<xref ref-type="fig" rid="pone.0140644.g006">Fig 6</xref>
demonstrates the total execution time of each method. While ARK does increase the execution time of Quikr and SEK, the total execution time is still significantly less than that of RDP’s NBC. Note that all datasets here are not de-duplicated. Execution time of RDP’s NBC can be accelerated by de-duplicating the data before classifying. However, this requires additional computational time to find duplicate sequences, and since we are directly comparing classification methods here (not computational shortcuts) we use the same non-de-duplicated data for all methods.</p>
<fig id="pone.0140644.g006" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0140644.g006</object-id>
<label>Fig 6</label>
<caption>
<title>Total execution time for each method on the 28 samples of real biological data.</title>
</caption>
<graphic xlink:href="pone.0140644.g006"></graphic>
</fig>
<p>To compare the results of each method, we compared PCoA (also known as classical multidimensional scaling) plots by employing the Jensen-Shannon divergence on each of the reconstructions. The points represent individual samples, and the color/shape denote the associated metadata. Each of the methods produced similar PCoA plots.
<xref ref-type="fig" rid="pone.0140644.g007">Fig 7</xref>
compares the results when using RDP’s NBC and
<xref ref-type="fig" rid="pone.0140644.g008">Fig 8</xref>
for ARK SEK when the sample body site is labeled. Note the similar clusterings.</p>
<fig id="pone.0140644.g007" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0140644.g007</object-id>
<label>Fig 7</label>
<caption>
<title>PCoA plots using the Jensen-Shannon divergence for RDP’s NBC.</title>
</caption>
<graphic xlink:href="pone.0140644.g007"></graphic>
</fig>
<fig id="pone.0140644.g008" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0140644.g008</object-id>
<label>Fig 8</label>
<caption>
<title>PCoA plots using the Jensen-Shannon divergence for ARK SEK.</title>
</caption>
<graphic xlink:href="pone.0140644.g008"></graphic>
</fig>
<p>As shown in Figs
<xref ref-type="fig" rid="pone.0140644.g009">9</xref>
and
<xref ref-type="fig" rid="pone.0140644.g010">10</xref>
, while ARK Quikr gave a somewhat similar PCoA plot with regard to body site (
<xref ref-type="fig" rid="pone.0140644.g009">Fig 9</xref>
), clustering by variable region (
<xref ref-type="fig" rid="pone.0140644.g010">Fig 10</xref>
) was also observed. This is most likely due to the fact that different variable regions have different
<italic>k</italic>
-mer distributions and different taxa will be preferentially amplified by the varying PCR primers [
<xref rid="pone.0140644.ref023" ref-type="bibr">23</xref>
]. ARK Quikr can detect this as it analyzes each sample in its entirety, as opposed to the read-by-read nature of RDP’s NBC. This is corroborated by the fact that when using the Jenson-Shannon divergence directly on the 6-mer counts, similar grouping was observed by variable region (results not shown).</p>
<fig id="pone.0140644.g009" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0140644.g009</object-id>
<label>Fig 9</label>
<caption>
<title>ARK Quikr PCoA plots (using the Jensen-Shannon divergence) on the real biological data.</title>
<p>In this case, we have labeling by body site. Note the clustering.</p>
</caption>
<graphic xlink:href="pone.0140644.g009"></graphic>
</fig>
<fig id="pone.0140644.g010" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0140644.g010</object-id>
<label>Fig 10</label>
<caption>
<title>ARK Quikr PCoA plots (using the Jensen-Shannon divergence) on the real biological data.</title>
<p>In this case, we have labeling by variable region. Note the clustering.</p>
</caption>
<graphic xlink:href="pone.0140644.g010"></graphic>
</fig>
</sec>
</sec>
<sec id="sec020">
<title>Discussion and Conclusion</title>
<p>The addition of a data processing step based on clustering the read information prior to community composition estimation is akin to the generic divide-and-conquer principle used judiciously in the machine learning field. In terms of information content of the read data, the individual means of the
<italic>k</italic>
-mer frequencies can collectively provide a better summary than the single mean vector used in the previous approaches, when sufficient heterogeneity is present among the sequences. Our experiments demonstrate this effect by a substantial increase in the accuracy of the resulting estimates. Moreover, the clustering employed by ARK is found to be robust in the sense that it does not lead to lower accuracies, even if a suboptimal number of clusters and clustering strategy were used. We found that the improvement in reconstruction accuracy was obtained at the cost of a moderate increase in execution time for the studied methods.</p>
<p>We note that under the clustering algorithm employed by ARK, no quantitative claims can be made concerning the global optimality of the resulting clusters or on consistent improvement in performance. Also, there is no absolute guarantee that the estimation of
<italic>p</italic>
(𝒞
<sub>
<italic>m</italic>
</sub>
) is bound to improve monotonically with an increase in
<italic>Q</italic>
. Thus, in an individual experiment, it is possible to encounter occasional degradation in performance. However, our results suggest that a larger number of clusters
<italic>Q</italic>
will tend to perform reasonably better than a much smaller value of
<italic>Q</italic>
, provided that the resulting cluster sizes are not too small to yield very noisy estimates of the mean vector.</p>
<p>While this study has focused on 16S rRNA gene sequencing based data, there is no theoretical limitation in applying this technique also to whole-genome shotgun (WGS) metagenomics. Indeed, ARK can readily be combined with existing WGS
<italic>k</italic>
-mer feature vector metagenomics reconstruction techniques (such as WGSQuikr [
<xref rid="pone.0140644.ref024" ref-type="bibr">24</xref>
]). Thus, we aim at investigating the versatility of this approach as complementary to other WGS metagenomics analysis methods in the future.</p>
</sec>
<sec sec-type="supplementary-material" id="sec021">
<title>Supporting Information</title>
<supplementary-material content-type="local-data" id="pone.0140644.s001">
<label>S1 File</label>
<caption>
<title>Supplementary Information for “ARK: Aggregation of Reads by K-means for Estimation of Bacterial Community Composition”.</title>
<p>This supporting information is available online. This supplementary material is included to address eight major points:
<list list-type="order">
<list-item>
<p>To compare ARK with the best performing bacterial community composition method to date, called BEBaC [
<xref rid="pone.0140644.ref008" ref-type="bibr">8</xref>
]. BEBaC employs a Bayesian estimation clustering framework along-with a stochastic search and sequence alignment.</p>
</list-item>
<list-item>
<p>To investigate the important question of finding the number of regions
<italic>Q</italic>
in ARK.</p>
</list-item>
<list-item>
<p>To independently verify ARK in two different geographic regions ((1) Sweden and Finland, and (2) USA) and also using different datasets.</p>
</list-item>
<list-item>
<p>To detail genera-level reconstructions of ARK SEK, ARK Quikr, and RDP’s NBC.</p>
</list-item>
<list-item>
<p>To detail the primers used to obtain the data in the main text.</p>
</list-item>
<list-item>
<p>To demonstrate the results are qualitatively independent of the error correction method chosen.</p>
</list-item>
<list-item>
<p>To detail the effect of changing the
<italic>k</italic>
-mer size.</p>
</list-item>
<list-item>
<p>To investigate the behavior of each method when sister taxa are excluded from the training database.</p>
</list-item>
</list>
</p>
<p>(PDF)</p>
</caption>
<media xlink:href="pone.0140644.s001.pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<ack>
<p>The authors wish to thank Paul Scott (Wellcome Trust Sanger Institute) for generating the 16S rRNA gene amplicons used in the real biological analyses. Real biological data from vaginal sites was collected in Mwanza, Tanzania from the Mwanza Intervention Trials Unit, National Institute of Medical Research.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="pone.0140644.ref001">
<label>1</label>
<mixed-citation publication-type="journal">
<name>
<surname>Wang</surname>
<given-names>Q</given-names>
</name>
,
<name>
<surname>Garrity</surname>
<given-names>GM</given-names>
</name>
,
<name>
<surname>Tiedje</surname>
<given-names>JM</given-names>
</name>
,
<name>
<surname>Cole</surname>
<given-names>JR</given-names>
</name>
.
<article-title>Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy</article-title>
.
<source>Appl Environ Microbiol</source>
.
<year>2007</year>
;
<volume>73</volume>
(
<issue>16</issue>
):
<fpage>5261</fpage>
<lpage>5267</lpage>
.
<pub-id pub-id-type="doi">10.1128/AEM.00062-07</pub-id>
<pub-id pub-id-type="pmid">17586664</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0140644.ref002">
<label>2</label>
<mixed-citation publication-type="journal">
<name>
<surname>Meinicke</surname>
<given-names>P</given-names>
</name>
,
<name>
<surname>Aßhauer</surname>
<given-names>KP</given-names>
</name>
,
<name>
<surname>Lingner</surname>
<given-names>T</given-names>
</name>
.
<article-title>Mixture models for analysis of the taxonomic composition of metagenomes</article-title>
.
<source>Bioinformatics</source>
.
<year>2011</year>
;
<volume>27</volume>
(
<issue>12</issue>
):
<fpage>1618</fpage>
<lpage>1624</lpage>
.
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr266</pub-id>
<pub-id pub-id-type="pmid">21546400</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0140644.ref003">
<label>3</label>
<mixed-citation publication-type="journal">
<name>
<surname>Koslicki</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Foucart</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Rosen</surname>
<given-names>G</given-names>
</name>
.
<article-title>Quikr: a method for rapid reconstruction of bacterial communities via compressive sensing</article-title>
.
<source>Bioinformatics</source>
.
<year>2013</year>
;
<volume>29</volume>
(
<issue>17</issue>
):
<fpage>2096</fpage>
<lpage>2102</lpage>
.
<pub-id pub-id-type="doi">10.1093/bioinformatics/btt336</pub-id>
<pub-id pub-id-type="pmid">23786768</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0140644.ref004">
<label>4</label>
<mixed-citation publication-type="journal">
<name>
<surname>Ong</surname>
<given-names>SH</given-names>
</name>
,
<name>
<surname>Kukkillaya</surname>
<given-names>VU</given-names>
</name>
,
<name>
<surname>Wilm</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Lay</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Ho</surname>
<given-names>EXP</given-names>
</name>
,
<name>
<surname>Low</surname>
<given-names>L</given-names>
</name>
,
<etal>et al</etal>
<article-title>Species Identification and Profiling of Complex Microbial Communities Using Shotgun Illumina Sequencing of 16S rRNA Amplicon Sequences</article-title>
.
<source>PLoS One</source>
.
<year>2013</year>
;
<volume>8</volume>
(
<issue>4</issue>
):
<fpage>e60811</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0060811</pub-id>
<pub-id pub-id-type="pmid">23579286</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0140644.ref005">
<label>5</label>
<mixed-citation publication-type="journal">
<name>
<surname>Dröge</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Gregor</surname>
<given-names>I</given-names>
</name>
,
<name>
<surname>McHardy</surname>
<given-names>A</given-names>
</name>
.
<article-title>Taxator-tk: Precise Taxonomic Assignment of Metagenomes by Fast Approximation of Evolutionary Neighborhoods</article-title>
.
<source>Bioinformatics</source>
.
<year>2014</year>
;
<volume>31</volume>
(
<issue>6</issue>
):
<fpage>817</fpage>
<lpage>824</lpage>
.
<pub-id pub-id-type="doi">10.1093/bioinformatics/btu745</pub-id>
<pub-id pub-id-type="pmid">25388150</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0140644.ref006">
<label>6</label>
<mixed-citation publication-type="journal">
<name>
<surname>Cai</surname>
<given-names>Y</given-names>
</name>
,
<name>
<surname>Sun</surname>
<given-names>Y</given-names>
</name>
.
<article-title>ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time</article-title>
.
<source>Nucleic Acids Research</source>
.
<year>2011</year>
;
<volume>39</volume>
(
<issue>14</issue>
):
<fpage>e95</fpage>
<pub-id pub-id-type="doi">10.1093/nar/gkr349</pub-id>
<pub-id pub-id-type="pmid">21596775</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0140644.ref007">
<label>7</label>
<mixed-citation publication-type="journal">
<name>
<surname>Edgar</surname>
<given-names>RC</given-names>
</name>
.
<article-title>Search and clustering orders of magnitude faster than BLAST</article-title>
.
<source>Bioinformatics</source>
.
<year>2010</year>
;
<volume>26</volume>
(
<issue>19</issue>
):
<fpage>2460</fpage>
<lpage>2461</lpage>
.
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq461</pub-id>
<pub-id pub-id-type="pmid">20709691</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0140644.ref008">
<label>8</label>
<mixed-citation publication-type="journal">
<name>
<surname>Cheng</surname>
<given-names>L</given-names>
</name>
,
<name>
<surname>Walker</surname>
<given-names>AW</given-names>
</name>
,
<name>
<surname>Corander</surname>
<given-names>J</given-names>
</name>
.
<article-title>Bayesian estimation of bacterial community composition from 454 sequencing data</article-title>
.
<source>Nucleic Acids Research</source>
.
<year>2012</year>
;
<volume>40</volume>
(
<issue>12</issue>
):
<fpage>5240</fpage>
<lpage>5249</lpage>
.
<pub-id pub-id-type="doi">10.1093/nar/gks227</pub-id>
<pub-id pub-id-type="pmid">22406836</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0140644.ref009">
<label>9</label>
<mixed-citation publication-type="journal">
<name>
<surname>Huson</surname>
<given-names>DH</given-names>
</name>
,
<name>
<surname>Auch</surname>
<given-names>AF</given-names>
</name>
,
<name>
<surname>Qi</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Schuster</surname>
<given-names>SC</given-names>
</name>
.
<article-title>MEGAN analysis of metagenomic data</article-title>
.
<source>Genome Res</source>
.
<year>2007</year>
;
<volume>17</volume>
(
<issue>3</issue>
):
<fpage>377</fpage>
<lpage>386</lpage>
.
<pub-id pub-id-type="doi">10.1101/gr.5969107</pub-id>
<pub-id pub-id-type="pmid">17255551</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0140644.ref010">
<label>10</label>
<mixed-citation publication-type="journal">
<name>
<surname>Mitra</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Stärk</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Huson</surname>
<given-names>DH</given-names>
</name>
.
<article-title>Analysis of 16S rRNA environmental sequences using MEGAN</article-title>
.
<source>BMC Genomics</source>
.
<year>2011</year>
;
<volume>12</volume>
(
<issue>Suppl 3</issue>
):
<fpage>S17</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2164-12-S3-S17</pub-id>
<pub-id pub-id-type="pmid">22369513</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0140644.ref011">
<label>11</label>
<mixed-citation publication-type="journal">
<name>
<surname>von Mering</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Hugenholtz</surname>
<given-names>P</given-names>
</name>
,
<name>
<surname>Raes</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Tringe</surname>
<given-names>SG</given-names>
</name>
,
<name>
<surname>Doerks</surname>
<given-names>T</given-names>
</name>
,
<name>
<surname>Jensen</surname>
<given-names>LJ</given-names>
</name>
,
<etal>et al</etal>
<article-title>Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments</article-title>
.
<source>Science</source>
.
<year>2007</year>
;
<volume>315</volume>
(
<issue>5815</issue>
):
<fpage>1126</fpage>
<lpage>1130</lpage>
.
<pub-id pub-id-type="doi">10.1126/science.1133420</pub-id>
<pub-id pub-id-type="pmid">17272687</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0140644.ref012">
<label>12</label>
<mixed-citation publication-type="journal">
<name>
<surname>Rosen</surname>
<given-names>G</given-names>
</name>
,
<name>
<surname>Garbarine</surname>
<given-names>E</given-names>
</name>
,
<name>
<surname>Caseiro</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Polikar</surname>
<given-names>R</given-names>
</name>
,
<name>
<surname>Sokhansanj</surname>
<given-names>B</given-names>
</name>
.
<article-title>Metagenome Fragment Classification Using k-Mer Frequency Profiles</article-title>
.
<source>Advances in Bioinformatics</source>
.
<year>2008</year>
;
<volume>2008</volume>
<pub-id pub-id-type="doi">10.1155/2008/205969</pub-id>
<pub-id pub-id-type="pmid">19956701</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0140644.ref013">
<label>13</label>
<mixed-citation publication-type="journal">
<name>
<surname>Rosen</surname>
<given-names>G</given-names>
</name>
,
<name>
<surname>Reichenberger</surname>
<given-names>E</given-names>
</name>
,
<name>
<surname>Rosenfeld</surname>
<given-names>A</given-names>
</name>
.
<article-title>NBC: the Naïve Bayes Classification tool webserver for taxonomic classification of metagenomic reads</article-title>
.
<source>Bioinformatics</source>
.
<year>2011</year>
;
<volume>27</volume>
(
<issue>1</issue>
):
<fpage>127</fpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq619</pub-id>
<pub-id pub-id-type="pmid">21062764</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0140644.ref014">
<label>14</label>
<mixed-citation publication-type="journal">
<name>
<surname>Chatterjee</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Koslicki</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Dong</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Innocenti</surname>
<given-names>N</given-names>
</name>
,
<name>
<surname>Cheng</surname>
<given-names>L</given-names>
</name>
,
<name>
<surname>Lan</surname>
<given-names>Y</given-names>
</name>
,
<etal>et al</etal>
<article-title>SEK: Sparsity exploiting
<italic>k</italic>
-mer-based estimation of bacterial community composition</article-title>
.
<source>Bioinformatics</source>
.
<year>2014</year>
;
<volume>30</volume>
(
<issue>17</issue>
):
<fpage>2423</fpage>
<lpage>31</lpage>
.
<pub-id pub-id-type="doi">10.1093/bioinformatics/btu320</pub-id>
<pub-id pub-id-type="pmid">24812337</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0140644.ref015">
<label>15</label>
<mixed-citation publication-type="journal">
<name>
<surname>Linde</surname>
<given-names>Y</given-names>
</name>
,
<name>
<surname>Buzo</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Gray</surname>
<given-names>RM</given-names>
</name>
.
<article-title>An Algorithm for Vector Quantizer Design</article-title>
.
<source>IEEE Transactions on Communications</source>
.
<year>1980</year>
;
<volume>28</volume>
(
<issue>1</issue>
):
<fpage>84</fpage>
<lpage>95</lpage>
.
<pub-id pub-id-type="doi">10.1109/TCOM.1980.1094577</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0140644.ref016">
<label>16</label>
<mixed-citation publication-type="journal">
<name>
<surname>Chatterjee</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Sreenivas</surname>
<given-names>TV</given-names>
</name>
.
<article-title>Conditional PDF-Based Split Vector Quantization of Wideband LSF Parameters</article-title>
.
<source>Signal Processing Letters, IEEE</source>
.
<year>2007</year>
<month>9</month>
;
<volume>14</volume>
(
<issue>9</issue>
):
<fpage>641</fpage>
<lpage>644</lpage>
.
<pub-id pub-id-type="doi">10.1109/LSP.2007.894960</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0140644.ref017">
<label>17</label>
<mixed-citation publication-type="journal">
<name>
<surname>Chatterjee</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Sreenivas</surname>
<given-names>TV</given-names>
</name>
.
<article-title>Optimum switched split vector quantization of LSF parameters</article-title>
.
<source>Signal Processing</source>
.
<year>2008</year>
;
<volume>88</volume>
(
<issue>6</issue>
):
<fpage>1528</fpage>
<lpage>1538</lpage>
.
<pub-id pub-id-type="doi">10.1016/j.sigpro.2008.01.001</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0140644.ref018">
<label>18</label>
<mixed-citation publication-type="journal">
<name>
<surname>Ambat</surname>
<given-names>SK</given-names>
</name>
,
<name>
<surname>Chatterjee</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Hari</surname>
<given-names>KVS</given-names>
</name>
.
<article-title>Fusion of Algorithms for Compressed Sensing</article-title>
.
<source>IEEE Transactions on Signal Processing</source>
.
<year>2013</year>
;
<volume>61</volume>
(
<issue>14</issue>
):
<fpage>3699</fpage>
<lpage>3704</lpage>
.
<pub-id pub-id-type="doi">10.1109/TSP.2013.2259821</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0140644.ref019">
<label>19</label>
<mixed-citation publication-type="journal">
<name>
<surname>Otu</surname>
<given-names>HH</given-names>
</name>
,
<name>
<surname>Sayood</surname>
<given-names>K</given-names>
</name>
.
<article-title>A divide-and-conquer approach to fragment assembly</article-title>
.
<source>Bioinformatics</source>
.
<year>2003</year>
;
<volume>19</volume>
(
<issue>1</issue>
):
<fpage>22</fpage>
<lpage>29</lpage>
.
<pub-id pub-id-type="doi">10.1093/bioinformatics/19.1.22</pub-id>
<pub-id pub-id-type="pmid">12499289</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0140644.ref020">
<label>20</label>
<mixed-citation publication-type="book">
<name>
<surname>Duda</surname>
<given-names>RO</given-names>
</name>
,
<name>
<surname>Hart</surname>
<given-names>PE</given-names>
</name>
,
<name>
<surname>Stork</surname>
<given-names>DG</given-names>
</name>
. In:
<source>Pattern Classification</source>
.
<publisher-name>Wiley</publisher-name>
;
<year>2010</year>
.</mixed-citation>
</ref>
<ref id="pone.0140644.ref021">
<label>21</label>
<mixed-citation publication-type="journal">
<name>
<surname>Angly</surname>
<given-names>FE</given-names>
</name>
,
<name>
<surname>Willner</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Rohwer</surname>
<given-names>F</given-names>
</name>
,
<name>
<surname>Hugenholtz</surname>
<given-names>P</given-names>
</name>
,
<name>
<surname>Tyson</surname>
<given-names>GW</given-names>
</name>
.
<article-title>Grinder: a versatile amplicon and shotgun sequence simulator</article-title>
.
<source>Nucleic Acids Research</source>
.
<year>2012</year>
;
<volume>40</volume>
(
<issue>12</issue>
):
<fpage>e94</fpage>
<pub-id pub-id-type="doi">10.1093/nar/gks251</pub-id>
<pub-id pub-id-type="pmid">22434876</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0140644.ref022">
<label>22</label>
<mixed-citation publication-type="journal">
<name>
<surname>Balzer</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Malde</surname>
<given-names>K</given-names>
</name>
,
<name>
<surname>Lanzén</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Sharma</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Jonassen</surname>
<given-names>I</given-names>
</name>
.
<article-title>Characteristics of 454 pyrosequencing data–enabling realistic simulation with flowsim</article-title>
.
<source>Bioinformatics</source>
.
<year>2010</year>
;
<volume>26</volume>
(
<issue>18</issue>
):
<fpage>i420</fpage>
<lpage>5</lpage>
.
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq365</pub-id>
<pub-id pub-id-type="pmid">20823302</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0140644.ref023">
<label>23</label>
<mixed-citation publication-type="journal">
<name>
<surname>Claesson</surname>
<given-names>MJ</given-names>
</name>
,
<name>
<surname>Wang</surname>
<given-names>Q</given-names>
</name>
,
<name>
<surname>O’Sullivan</surname>
<given-names>O</given-names>
</name>
,
<name>
<surname>Greene-Diniz</surname>
<given-names>R</given-names>
</name>
,
<name>
<surname>Cole</surname>
<given-names>JR</given-names>
</name>
,
<name>
<surname>Ross</surname>
<given-names>RP</given-names>
</name>
,
<name>
<surname>O’Toole</surname>
<given-names>PW</given-names>
</name>
.
<article-title>Comparison of two next-generation sequencing technologies for resolving highly complex microbiota composition using tandem variable 16S rRNA gene regions</article-title>
.
<source>Nucleic Acids Research</source>
.
<year>2010</year>
;
<volume>38</volume>
(
<issue>22</issue>
)
<fpage>e200</fpage>
<pub-id pub-id-type="doi">10.1093/nar/gkq873</pub-id>
<pub-id pub-id-type="pmid">20880993</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0140644.ref024">
<label>24</label>
<mixed-citation publication-type="journal">
<name>
<surname>Koslicki</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Foucart</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Rosen</surname>
<given-names>G</given-names>
</name>
.
<article-title>WGSQuikr: Fast Whole-Genome Shotgun Metagenomic Classification</article-title>
.
<source>PloS one</source>
.
<year>2014</year>
;
<volume>9</volume>
(
<issue>3</issue>
):
<fpage>e91784</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0091784</pub-id>
<pub-id pub-id-type="pmid">24626336</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 0010149 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 0010149 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021