Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition

Identifieur interne : 001014 ( Pmc/Curation ); précédent : 001013; suivant : 001015

ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition

Auteurs : David Koslicki [États-Unis] ; Saikat Chatterjee [Suède] ; Damon Shahrivar [Suède] ; Alan W. Walker [Royaume-Uni] ; Suzanna C. Francis [Royaume-Uni] ; Louise J. Fraser [Royaume-Uni] ; Mikko Vehkaper [Royaume-Uni] ; Yueheng Lan [République populaire de Chine] ; Jukka Corander [Finlande]

Source :

RBID : PMC:4619776

Abstract

Motivation

Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging.

Results

There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity.

Availability

An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.


Url:
DOI: 10.1371/journal.pone.0140644
PubMed: 26496191
PubMed Central: 4619776

Links toward previous steps (curation, corpus...)


Links to Exploration step

PMC:4619776

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition</title>
<author>
<name sortKey="Koslicki, David" sort="Koslicki, David" uniqKey="Koslicki D" first="David" last="Koslicki">David Koslicki</name>
<affiliation wicri:level="1">
<nlm:aff id="aff001">
<addr-line>Dept of Mathematics, Oregon State University, Corvallis, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Dept of Mathematics, Oregon State University, Corvallis</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Chatterjee, Saikat" sort="Chatterjee, Saikat" uniqKey="Chatterjee S" first="Saikat" last="Chatterjee">Saikat Chatterjee</name>
<affiliation wicri:level="1">
<nlm:aff id="aff002">
<addr-line>Dept of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden</addr-line>
</nlm:aff>
<country xml:lang="fr">Suède</country>
<wicri:regionArea>Dept of Communication Theory, KTH Royal Institute of Technology, Stockholm</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Shahrivar, Damon" sort="Shahrivar, Damon" uniqKey="Shahrivar D" first="Damon" last="Shahrivar">Damon Shahrivar</name>
<affiliation wicri:level="1">
<nlm:aff id="aff002">
<addr-line>Dept of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden</addr-line>
</nlm:aff>
<country xml:lang="fr">Suède</country>
<wicri:regionArea>Dept of Communication Theory, KTH Royal Institute of Technology, Stockholm</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Walker, Alan W" sort="Walker, Alan W" uniqKey="Walker A" first="Alan W." last="Walker">Alan W. Walker</name>
<affiliation wicri:level="1">
<nlm:aff id="aff003">
<addr-line>Microbiology Group, Rowett Institute of Nutrition and Health, University of Aberdeen, Aberdeen, United Kingdom</addr-line>
</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Microbiology Group, Rowett Institute of Nutrition and Health, University of Aberdeen, Aberdeen</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Francis, Suzanna C" sort="Francis, Suzanna C" uniqKey="Francis S" first="Suzanna C." last="Francis">Suzanna C. Francis</name>
<affiliation wicri:level="1">
<nlm:aff id="aff004">
<addr-line>MRC Tropical Epidemiology Group, London School of Hygiene and Tropical Medicine, London, United Kingdom</addr-line>
</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>MRC Tropical Epidemiology Group, London School of Hygiene and Tropical Medicine, London</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Fraser, Louise J" sort="Fraser, Louise J" uniqKey="Fraser L" first="Louise J." last="Fraser">Louise J. Fraser</name>
<affiliation wicri:level="1">
<nlm:aff id="aff005">
<addr-line>Illumina Cambridge Ltd., Chesterford Research Park, Essex, United Kingdom</addr-line>
</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Illumina Cambridge Ltd., Chesterford Research Park, Essex</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Vehkaper, Mikko" sort="Vehkaper, Mikko" uniqKey="Vehkaper M" first="Mikko" last="Vehkaper">Mikko Vehkaper</name>
<affiliation wicri:level="1">
<nlm:aff id="aff006">
<addr-line>Dept of Electronic and Electrical Engineering, University of Sheffield, Sheffield, United Kingdom</addr-line>
</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Dept of Electronic and Electrical Engineering, University of Sheffield, Sheffield</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Lan, Yueheng" sort="Lan, Yueheng" uniqKey="Lan Y" first="Yueheng" last="Lan">Yueheng Lan</name>
<affiliation wicri:level="1">
<nlm:aff id="aff007">
<addr-line>Dept of Physics, Tsinghua University, Beijing, China</addr-line>
</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Dept of Physics, Tsinghua University, Beijing</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Corander, Jukka" sort="Corander, Jukka" uniqKey="Corander J" first="Jukka" last="Corander">Jukka Corander</name>
<affiliation wicri:level="1">
<nlm:aff id="aff008">
<addr-line>Dept of Mathematics and Statistics, University of Helsinki, Helsinki, Finland</addr-line>
</nlm:aff>
<country xml:lang="fr">Finlande</country>
<wicri:regionArea>Dept of Mathematics and Statistics, University of Helsinki, Helsinki</wicri:regionArea>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">26496191</idno>
<idno type="pmc">4619776</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4619776</idno>
<idno type="RBID">PMC:4619776</idno>
<idno type="doi">10.1371/journal.pone.0140644</idno>
<date when="2015">2015</date>
<idno type="wicri:Area/Pmc/Corpus">001014</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">001014</idno>
<idno type="wicri:Area/Pmc/Curation">001014</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">001014</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition</title>
<author>
<name sortKey="Koslicki, David" sort="Koslicki, David" uniqKey="Koslicki D" first="David" last="Koslicki">David Koslicki</name>
<affiliation wicri:level="1">
<nlm:aff id="aff001">
<addr-line>Dept of Mathematics, Oregon State University, Corvallis, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Dept of Mathematics, Oregon State University, Corvallis</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Chatterjee, Saikat" sort="Chatterjee, Saikat" uniqKey="Chatterjee S" first="Saikat" last="Chatterjee">Saikat Chatterjee</name>
<affiliation wicri:level="1">
<nlm:aff id="aff002">
<addr-line>Dept of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden</addr-line>
</nlm:aff>
<country xml:lang="fr">Suède</country>
<wicri:regionArea>Dept of Communication Theory, KTH Royal Institute of Technology, Stockholm</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Shahrivar, Damon" sort="Shahrivar, Damon" uniqKey="Shahrivar D" first="Damon" last="Shahrivar">Damon Shahrivar</name>
<affiliation wicri:level="1">
<nlm:aff id="aff002">
<addr-line>Dept of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden</addr-line>
</nlm:aff>
<country xml:lang="fr">Suède</country>
<wicri:regionArea>Dept of Communication Theory, KTH Royal Institute of Technology, Stockholm</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Walker, Alan W" sort="Walker, Alan W" uniqKey="Walker A" first="Alan W." last="Walker">Alan W. Walker</name>
<affiliation wicri:level="1">
<nlm:aff id="aff003">
<addr-line>Microbiology Group, Rowett Institute of Nutrition and Health, University of Aberdeen, Aberdeen, United Kingdom</addr-line>
</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Microbiology Group, Rowett Institute of Nutrition and Health, University of Aberdeen, Aberdeen</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Francis, Suzanna C" sort="Francis, Suzanna C" uniqKey="Francis S" first="Suzanna C." last="Francis">Suzanna C. Francis</name>
<affiliation wicri:level="1">
<nlm:aff id="aff004">
<addr-line>MRC Tropical Epidemiology Group, London School of Hygiene and Tropical Medicine, London, United Kingdom</addr-line>
</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>MRC Tropical Epidemiology Group, London School of Hygiene and Tropical Medicine, London</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Fraser, Louise J" sort="Fraser, Louise J" uniqKey="Fraser L" first="Louise J." last="Fraser">Louise J. Fraser</name>
<affiliation wicri:level="1">
<nlm:aff id="aff005">
<addr-line>Illumina Cambridge Ltd., Chesterford Research Park, Essex, United Kingdom</addr-line>
</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Illumina Cambridge Ltd., Chesterford Research Park, Essex</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Vehkaper, Mikko" sort="Vehkaper, Mikko" uniqKey="Vehkaper M" first="Mikko" last="Vehkaper">Mikko Vehkaper</name>
<affiliation wicri:level="1">
<nlm:aff id="aff006">
<addr-line>Dept of Electronic and Electrical Engineering, University of Sheffield, Sheffield, United Kingdom</addr-line>
</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Dept of Electronic and Electrical Engineering, University of Sheffield, Sheffield</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Lan, Yueheng" sort="Lan, Yueheng" uniqKey="Lan Y" first="Yueheng" last="Lan">Yueheng Lan</name>
<affiliation wicri:level="1">
<nlm:aff id="aff007">
<addr-line>Dept of Physics, Tsinghua University, Beijing, China</addr-line>
</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Dept of Physics, Tsinghua University, Beijing</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Corander, Jukka" sort="Corander, Jukka" uniqKey="Corander J" first="Jukka" last="Corander">Jukka Corander</name>
<affiliation wicri:level="1">
<nlm:aff id="aff008">
<addr-line>Dept of Mathematics and Statistics, University of Helsinki, Helsinki, Finland</addr-line>
</nlm:aff>
<country xml:lang="fr">Finlande</country>
<wicri:regionArea>Dept of Mathematics and Statistics, University of Helsinki, Helsinki</wicri:regionArea>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PLoS ONE</title>
<idno type="eISSN">1932-6203</idno>
<imprint>
<date when="2015">2015</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec id="sec001">
<title>Motivation</title>
<p>Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging.</p>
</sec>
<sec id="sec002">
<title>Results</title>
<p>There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order
<italic>k</italic>
-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The
<italic>aggregation of reads</italic>
is a
<italic>pre-processing</italic>
approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of
<italic>k</italic>
-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity.</p>
</sec>
<sec id="sec003">
<title>Availability</title>
<p>An open source, platform-independent implementation of the method in the Julia programming language is freely available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/dkoslicki/ARK">https://github.com/dkoslicki/ARK</ext-link>
. A Matlab implementation is available at
<ext-link ext-link-type="uri" xlink:href="http://www.ee.kth.se/ctsoftware">http://www.ee.kth.se/ctsoftware</ext-link>
.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Wang, Q" uniqKey="Wang Q">Q Wang</name>
</author>
<author>
<name sortKey="Garrity, Gm" uniqKey="Garrity G">GM Garrity</name>
</author>
<author>
<name sortKey="Tiedje, Jm" uniqKey="Tiedje J">JM Tiedje</name>
</author>
<author>
<name sortKey="Cole, Jr" uniqKey="Cole J">JR Cole</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Meinicke, P" uniqKey="Meinicke P">P Meinicke</name>
</author>
<author>
<name sortKey="A Hauer, Kp" uniqKey="A Hauer K">KP Aßhauer</name>
</author>
<author>
<name sortKey="Lingner, T" uniqKey="Lingner T">T Lingner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Koslicki, D" uniqKey="Koslicki D">D Koslicki</name>
</author>
<author>
<name sortKey="Foucart, S" uniqKey="Foucart S">S Foucart</name>
</author>
<author>
<name sortKey="Rosen, G" uniqKey="Rosen G">G Rosen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ong, Sh" uniqKey="Ong S">SH Ong</name>
</author>
<author>
<name sortKey="Kukkillaya, Vu" uniqKey="Kukkillaya V">VU Kukkillaya</name>
</author>
<author>
<name sortKey="Wilm, A" uniqKey="Wilm A">A Wilm</name>
</author>
<author>
<name sortKey="Lay, C" uniqKey="Lay C">C Lay</name>
</author>
<author>
<name sortKey="Ho, Exp" uniqKey="Ho E">EXP Ho</name>
</author>
<author>
<name sortKey="Low, L" uniqKey="Low L">L Low</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Droge, J" uniqKey="Droge J">J Dröge</name>
</author>
<author>
<name sortKey="Gregor, I" uniqKey="Gregor I">I Gregor</name>
</author>
<author>
<name sortKey="Mchardy, A" uniqKey="Mchardy A">A McHardy</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cai, Y" uniqKey="Cai Y">Y Cai</name>
</author>
<author>
<name sortKey="Sun, Y" uniqKey="Sun Y">Y Sun</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Edgar, Rc" uniqKey="Edgar R">RC Edgar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cheng, L" uniqKey="Cheng L">L Cheng</name>
</author>
<author>
<name sortKey="Walker, Aw" uniqKey="Walker A">AW Walker</name>
</author>
<author>
<name sortKey="Corander, J" uniqKey="Corander J">J Corander</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huson, Dh" uniqKey="Huson D">DH Huson</name>
</author>
<author>
<name sortKey="Auch, Af" uniqKey="Auch A">AF Auch</name>
</author>
<author>
<name sortKey="Qi, J" uniqKey="Qi J">J Qi</name>
</author>
<author>
<name sortKey="Schuster, Sc" uniqKey="Schuster S">SC Schuster</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mitra, S" uniqKey="Mitra S">S Mitra</name>
</author>
<author>
<name sortKey="St Rk, M" uniqKey="St Rk M">M Stärk</name>
</author>
<author>
<name sortKey="Huson, Dh" uniqKey="Huson D">DH Huson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Von Mering, C" uniqKey="Von Mering C">C von Mering</name>
</author>
<author>
<name sortKey="Hugenholtz, P" uniqKey="Hugenholtz P">P Hugenholtz</name>
</author>
<author>
<name sortKey="Raes, J" uniqKey="Raes J">J Raes</name>
</author>
<author>
<name sortKey="Tringe, Sg" uniqKey="Tringe S">SG Tringe</name>
</author>
<author>
<name sortKey="Doerks, T" uniqKey="Doerks T">T Doerks</name>
</author>
<author>
<name sortKey="Jensen, Lj" uniqKey="Jensen L">LJ Jensen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rosen, G" uniqKey="Rosen G">G Rosen</name>
</author>
<author>
<name sortKey="Garbarine, E" uniqKey="Garbarine E">E Garbarine</name>
</author>
<author>
<name sortKey="Caseiro, D" uniqKey="Caseiro D">D Caseiro</name>
</author>
<author>
<name sortKey="Polikar, R" uniqKey="Polikar R">R Polikar</name>
</author>
<author>
<name sortKey="Sokhansanj, B" uniqKey="Sokhansanj B">B Sokhansanj</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rosen, G" uniqKey="Rosen G">G Rosen</name>
</author>
<author>
<name sortKey="Reichenberger, E" uniqKey="Reichenberger E">E Reichenberger</name>
</author>
<author>
<name sortKey="Rosenfeld, A" uniqKey="Rosenfeld A">A Rosenfeld</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chatterjee, S" uniqKey="Chatterjee S">S Chatterjee</name>
</author>
<author>
<name sortKey="Koslicki, D" uniqKey="Koslicki D">D Koslicki</name>
</author>
<author>
<name sortKey="Dong, S" uniqKey="Dong S">S Dong</name>
</author>
<author>
<name sortKey="Innocenti, N" uniqKey="Innocenti N">N Innocenti</name>
</author>
<author>
<name sortKey="Cheng, L" uniqKey="Cheng L">L Cheng</name>
</author>
<author>
<name sortKey="Lan, Y" uniqKey="Lan Y">Y Lan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Linde, Y" uniqKey="Linde Y">Y Linde</name>
</author>
<author>
<name sortKey="Buzo, A" uniqKey="Buzo A">A Buzo</name>
</author>
<author>
<name sortKey="Gray, Rm" uniqKey="Gray R">RM Gray</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chatterjee, S" uniqKey="Chatterjee S">S Chatterjee</name>
</author>
<author>
<name sortKey="Sreenivas, Tv" uniqKey="Sreenivas T">TV Sreenivas</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chatterjee, S" uniqKey="Chatterjee S">S Chatterjee</name>
</author>
<author>
<name sortKey="Sreenivas, Tv" uniqKey="Sreenivas T">TV Sreenivas</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ambat, Sk" uniqKey="Ambat S">SK Ambat</name>
</author>
<author>
<name sortKey="Chatterjee, S" uniqKey="Chatterjee S">S Chatterjee</name>
</author>
<author>
<name sortKey="Hari, Kvs" uniqKey="Hari K">KVS Hari</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Otu, Hh" uniqKey="Otu H">HH Otu</name>
</author>
<author>
<name sortKey="Sayood, K" uniqKey="Sayood K">K Sayood</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Duda, Ro" uniqKey="Duda R">RO Duda</name>
</author>
<author>
<name sortKey="Hart, Pe" uniqKey="Hart P">PE Hart</name>
</author>
<author>
<name sortKey="Stork, Dg" uniqKey="Stork D">DG Stork</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Angly, Fe" uniqKey="Angly F">FE Angly</name>
</author>
<author>
<name sortKey="Willner, D" uniqKey="Willner D">D Willner</name>
</author>
<author>
<name sortKey="Rohwer, F" uniqKey="Rohwer F">F Rohwer</name>
</author>
<author>
<name sortKey="Hugenholtz, P" uniqKey="Hugenholtz P">P Hugenholtz</name>
</author>
<author>
<name sortKey="Tyson, Gw" uniqKey="Tyson G">GW Tyson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Balzer, S" uniqKey="Balzer S">S Balzer</name>
</author>
<author>
<name sortKey="Malde, K" uniqKey="Malde K">K Malde</name>
</author>
<author>
<name sortKey="Lanzen, A" uniqKey="Lanzen A">A Lanzén</name>
</author>
<author>
<name sortKey="Sharma, A" uniqKey="Sharma A">A Sharma</name>
</author>
<author>
<name sortKey="Jonassen, I" uniqKey="Jonassen I">I Jonassen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Claesson, Mj" uniqKey="Claesson M">MJ Claesson</name>
</author>
<author>
<name sortKey="Wang, Q" uniqKey="Wang Q">Q Wang</name>
</author>
<author>
<name sortKey="O Ullivan, O" uniqKey="O Ullivan O">O O’Sullivan</name>
</author>
<author>
<name sortKey="Greene Diniz, R" uniqKey="Greene Diniz R">R Greene-Diniz</name>
</author>
<author>
<name sortKey="Cole, Jr" uniqKey="Cole J">JR Cole</name>
</author>
<author>
<name sortKey="Ross, Rp" uniqKey="Ross R">RP Ross</name>
</author>
<author>
<name sortKey="O Oole, Pw" uniqKey="O Oole P">PW O’Toole</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Koslicki, D" uniqKey="Koslicki D">D Koslicki</name>
</author>
<author>
<name sortKey="Foucart, S" uniqKey="Foucart S">S Foucart</name>
</author>
<author>
<name sortKey="Rosen, G" uniqKey="Rosen G">G Rosen</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">PLoS One</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS ONE</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">plosone</journal-id>
<journal-title-group>
<journal-title>PLoS ONE</journal-title>
</journal-title-group>
<issn pub-type="epub">1932-6203</issn>
<publisher>
<publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, CA USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">26496191</article-id>
<article-id pub-id-type="pmc">4619776</article-id>
<article-id pub-id-type="publisher-id">PONE-D-15-15487</article-id>
<article-id pub-id-type="doi">10.1371/journal.pone.0140644</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition</article-title>
<alt-title alt-title-type="running-head">ARK</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Koslicki</surname>
<given-names>David</given-names>
</name>
<xref ref-type="aff" rid="aff001">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Chatterjee</surname>
<given-names>Saikat</given-names>
</name>
<xref ref-type="aff" rid="aff002">
<sup>2</sup>
</xref>
<xref ref-type="corresp" rid="cor001">*</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Shahrivar</surname>
<given-names>Damon</given-names>
</name>
<xref ref-type="aff" rid="aff002">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Walker</surname>
<given-names>Alan W.</given-names>
</name>
<xref ref-type="aff" rid="aff003">
<sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Francis</surname>
<given-names>Suzanna C.</given-names>
</name>
<xref ref-type="aff" rid="aff004">
<sup>4</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Fraser</surname>
<given-names>Louise J.</given-names>
</name>
<xref ref-type="aff" rid="aff005">
<sup>5</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Vehkaperä</surname>
<given-names>Mikko</given-names>
</name>
<xref ref-type="aff" rid="aff006">
<sup>6</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Lan</surname>
<given-names>Yueheng</given-names>
</name>
<xref ref-type="aff" rid="aff007">
<sup>7</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Corander</surname>
<given-names>Jukka</given-names>
</name>
<xref ref-type="aff" rid="aff008">
<sup>8</sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff001">
<label>1</label>
<addr-line>Dept of Mathematics, Oregon State University, Corvallis, United States of America</addr-line>
</aff>
<aff id="aff002">
<label>2</label>
<addr-line>Dept of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden</addr-line>
</aff>
<aff id="aff003">
<label>3</label>
<addr-line>Microbiology Group, Rowett Institute of Nutrition and Health, University of Aberdeen, Aberdeen, United Kingdom</addr-line>
</aff>
<aff id="aff004">
<label>4</label>
<addr-line>MRC Tropical Epidemiology Group, London School of Hygiene and Tropical Medicine, London, United Kingdom</addr-line>
</aff>
<aff id="aff005">
<label>5</label>
<addr-line>Illumina Cambridge Ltd., Chesterford Research Park, Essex, United Kingdom</addr-line>
</aff>
<aff id="aff006">
<label>6</label>
<addr-line>Dept of Electronic and Electrical Engineering, University of Sheffield, Sheffield, United Kingdom</addr-line>
</aff>
<aff id="aff007">
<label>7</label>
<addr-line>Dept of Physics, Tsinghua University, Beijing, China</addr-line>
</aff>
<aff id="aff008">
<label>8</label>
<addr-line>Dept of Mathematics and Statistics, University of Helsinki, Helsinki, Finland</addr-line>
</aff>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Badger</surname>
<given-names>Jonathan H.</given-names>
</name>
<role>Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1">
<addr-line>National Cancer Institute, UNITED STATES</addr-line>
</aff>
<author-notes>
<fn fn-type="COI-statement" id="coi001">
<p>
<bold>Competing Interests: </bold>
L.J.F. received funding in the form of salary from Illumina Cambridge Ltd. This does not alter the authors’ adherence to all the PLOS ONE policies on sharing data and materials.</p>
</fn>
<fn fn-type="con" id="contrib001">
<p>Conceived and designed the experiments: SC DK DS. Performed the experiments: DK SC DS. Analyzed the data: SC DK AWW JC. Contributed reagents/materials/analysis tools: AWW SCF LJF JC. Wrote the paper: DK SC AWW MV YL JC. Led the team: SC.</p>
</fn>
<corresp id="cor001">* E-mail:
<email>sach@kth.se</email>
</corresp>
</author-notes>
<pub-date pub-type="collection">
<year>2015</year>
</pub-date>
<pub-date pub-type="epub">
<day>23</day>
<month>10</month>
<year>2015</year>
</pub-date>
<volume>10</volume>
<issue>10</issue>
<elocation-id>e0140644</elocation-id>
<history>
<date date-type="received">
<day>20</day>
<month>4</month>
<year>2015</year>
</date>
<date date-type="accepted">
<day>28</day>
<month>9</month>
<year>2015</year>
</date>
</history>
<permissions>
<license xlink:href="https://creativecommons.org/publicdomain/zero/1.0/">
<license-p>This is an open-access article distributed under the terms of the Creative Commons Public Domain declaration, which stipulates that, once placed in the public domain, this work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:type="simple" xlink:href="pone.0140644.pdf"></self-uri>
<abstract>
<sec id="sec001">
<title>Motivation</title>
<p>Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging.</p>
</sec>
<sec id="sec002">
<title>Results</title>
<p>There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order
<italic>k</italic>
-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The
<italic>aggregation of reads</italic>
is a
<italic>pre-processing</italic>
approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of
<italic>k</italic>
-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity.</p>
</sec>
<sec id="sec003">
<title>Availability</title>
<p>An open source, platform-independent implementation of the method in the Julia programming language is freely available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/dkoslicki/ARK">https://github.com/dkoslicki/ARK</ext-link>
. A Matlab implementation is available at
<ext-link ext-link-type="uri" xlink:href="http://www.ee.kth.se/ctsoftware">http://www.ee.kth.se/ctsoftware</ext-link>
.</p>
</sec>
</abstract>
<funding-group>
<funding-statement>This work was supported by the Swedish Research Council Linnaeus Centre ACCESS (S.C.), ERC grant 239784 (J.C.), the Academy of Finland Center of Excellence COIN (J.C.), the Academy of Finland (M.V.), the Scottish Government’s Rural and Environment Science and Analytical Services Division (RESAS) (A.W.W), and the UK MRC/DFID grant G1002369 (S.C.F). L.J.F. received funding in the form of salary from Illumina Cambridge Ltd. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</funding-statement>
</funding-group>
<counts>
<fig-count count="10"></fig-count>
<table-count count="0"></table-count>
<page-count count="16"></page-count>
</counts>
<custom-meta-group>
<custom-meta id="data-availability">
<meta-name>Data Availability</meta-name>
<meta-value>The data and programs are available from GitHub (
<ext-link ext-link-type="uri" xlink:href="https://github.com/dkoslicki/ARK">https://github.com/dkoslicki/ARK</ext-link>
) or from here (
<ext-link ext-link-type="uri" xlink:href="http://www.ee.kth.se/ctsoftware">http://www.ee.kth.se/ctsoftware</ext-link>
). Further real biological data that we used for experiment have been submitted to the European Nucleotide Archive using the accession number PRJEB9828.</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
<notes>
<title>Data Availability</title>
<p>The data and programs are available from GitHub (
<ext-link ext-link-type="uri" xlink:href="https://github.com/dkoslicki/ARK">https://github.com/dkoslicki/ARK</ext-link>
) or from here (
<ext-link ext-link-type="uri" xlink:href="http://www.ee.kth.se/ctsoftware">http://www.ee.kth.se/ctsoftware</ext-link>
). Further real biological data that we used for experiment have been submitted to the European Nucleotide Archive using the accession number PRJEB9828.</p>
</notes>
</front>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001014 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd -nk 001014 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Curation
   |type=    RBID
   |clé=     PMC:4619776
   |texte=   ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Curation/RBID.i   -Sk "pubmed:26496191" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021