Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Streaming histogram sketching for rapid microbiome analytics

Identifieur interne : 000398 ( Pmc/Curation ); précédent : 000397; suivant : 000399

Streaming histogram sketching for rapid microbiome analytics

Auteurs : Will Pm Rowe [Royaume-Uni] ; Anna Paola Carrieri [Royaume-Uni] ; Cristina Alcon-Giner [Royaume-Uni] ; Shabhonam Caim [Royaume-Uni] ; Alex Shaw [Royaume-Uni] ; Kathleen Sim [Royaume-Uni] ; J. Simon Kroll [Royaume-Uni] ; Lindsay J. Hall [Royaume-Uni] ; Edward O. Pyzer-Knapp [Royaume-Uni] ; Martyn D. Winn [Royaume-Uni]

Source :

RBID : PMC:6420756

Abstract

Background

The growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research, allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works. This vast amount of microbiome data, as well as the widespread proliferation of microbiome research and the looming era of clinical metagenomics, means there is an urgent need to develop analytics that can process huge amounts of data in a short amount of time.

To address this need, we propose a new method for tyrhe compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. These sketches allow for dissimilarity estimation, rapid microbiome catalogue searching and classification of microbiome samples in near real time.

Results

We apply streaming histogram sketching to microbiome samples as a form of dimensionality reduction, creating a compressed ‘histosketch’ that can efficiently represent microbiome k-mer spectra. Using public microbiome datasets, we show that histosketches can be clustered by sample type using the pairwise Jaccard similarity estimation, consequently allowing for rapid microbiome similarity searches via a locality sensitive hashing indexing scheme.

Furthermore, we use a ‘real life’ example to show that histosketches can train machine learning classifiers to accurately label microbiome samples. Specifically, using a collection of 108 novel microbiome samples from a cohort of premature neonates, we trained and tested a random forest classifier that could accurately predict whether the neonate had received antibiotic treatment (97% accuracy, 96% precision) and could subsequently be used to classify microbiome data streams in less than 3 s.

Conclusions

Our method offers a new approach to rapidly process microbiome data streams, allowing samples to be rapidly clustered, indexed and classified. We also provide our implementation, Histosketching Using Little K-mers (HULK), which can histosketch a typical 2 GB microbiome in 50 s on a standard laptop using four cores, with the sketch occupying 3000 bytes of disk space. (https://github.com/will-rowe/hulk).


Url:
DOI: 10.1186/s40168-019-0653-2
PubMed: 30878035
PubMed Central: 6420756

Links toward previous steps (curation, corpus...)


Links to Exploration step

PMC:6420756

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Streaming histogram sketching for rapid microbiome analytics</title>
<author>
<name sortKey="Rowe, Will Pm" sort="Rowe, Will Pm" uniqKey="Rowe W" first="Will Pm" last="Rowe">Will Pm Rowe</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0727 2226</institution-id>
<institution-id institution-id-type="GRID">grid.482271.a</institution-id>
<institution>Scientific Computing Department,</institution>
<institution>STFC Daresbury Laboratory,</institution>
</institution-wrap>
Warrington, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Warrington</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Carrieri, Anna Paola" sort="Carrieri, Anna Paola" uniqKey="Carrieri A" first="Anna Paola" last="Carrieri">Anna Paola Carrieri</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="GRID">grid.14467.30</institution-id>
<institution>IBM Research,</institution>
<institution>The Hartree Centre,</institution>
</institution-wrap>
Warrington, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Warrington</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Alcon Giner, Cristina" sort="Alcon Giner, Cristina" uniqKey="Alcon Giner C" first="Cristina" last="Alcon-Giner">Cristina Alcon-Giner</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="GRID">grid.420132.6</institution-id>
<institution>Quadram Institute Bioscience,</institution>
<institution>Norwich Research Park,</institution>
</institution-wrap>
Norwich, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Norwich</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Caim, Shabhonam" sort="Caim, Shabhonam" uniqKey="Caim S" first="Shabhonam" last="Caim">Shabhonam Caim</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="GRID">grid.420132.6</institution-id>
<institution>Quadram Institute Bioscience,</institution>
<institution>Norwich Research Park,</institution>
</institution-wrap>
Norwich, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Norwich</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Shaw, Alex" sort="Shaw, Alex" uniqKey="Shaw A" first="Alex" last="Shaw">Alex Shaw</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff4">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2113 8111</institution-id>
<institution-id institution-id-type="GRID">grid.7445.2</institution-id>
<institution>Department of Medicine, Section of Paediatrics,</institution>
<institution>Imperial College London,</institution>
</institution-wrap>
London, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>London</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Sim, Kathleen" sort="Sim, Kathleen" uniqKey="Sim K" first="Kathleen" last="Sim">Kathleen Sim</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff4">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2113 8111</institution-id>
<institution-id institution-id-type="GRID">grid.7445.2</institution-id>
<institution>Department of Medicine, Section of Paediatrics,</institution>
<institution>Imperial College London,</institution>
</institution-wrap>
London, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>London</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Kroll, J Simon" sort="Kroll, J Simon" uniqKey="Kroll J" first="J. Simon" last="Kroll">J. Simon Kroll</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff4">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2113 8111</institution-id>
<institution-id institution-id-type="GRID">grid.7445.2</institution-id>
<institution>Department of Medicine, Section of Paediatrics,</institution>
<institution>Imperial College London,</institution>
</institution-wrap>
London, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>London</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Hall, Lindsay J" sort="Hall, Lindsay J" uniqKey="Hall L" first="Lindsay J." last="Hall">Lindsay J. Hall</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="GRID">grid.420132.6</institution-id>
<institution>Quadram Institute Bioscience,</institution>
<institution>Norwich Research Park,</institution>
</institution-wrap>
Norwich, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Norwich</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Pyzer Knapp, Edward O" sort="Pyzer Knapp, Edward O" uniqKey="Pyzer Knapp E" first="Edward O." last="Pyzer-Knapp">Edward O. Pyzer-Knapp</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="GRID">grid.14467.30</institution-id>
<institution>IBM Research,</institution>
<institution>The Hartree Centre,</institution>
</institution-wrap>
Warrington, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Warrington</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Winn, Martyn D" sort="Winn, Martyn D" uniqKey="Winn M" first="Martyn D." last="Winn">Martyn D. Winn</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0727 2226</institution-id>
<institution-id institution-id-type="GRID">grid.482271.a</institution-id>
<institution>Scientific Computing Department,</institution>
<institution>STFC Daresbury Laboratory,</institution>
</institution-wrap>
Warrington, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Warrington</wicri:regionArea>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">30878035</idno>
<idno type="pmc">6420756</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6420756</idno>
<idno type="RBID">PMC:6420756</idno>
<idno type="doi">10.1186/s40168-019-0653-2</idno>
<date when="2019">2019</date>
<idno type="wicri:Area/Pmc/Corpus">000398</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000398</idno>
<idno type="wicri:Area/Pmc/Curation">000398</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">000398</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Streaming histogram sketching for rapid microbiome analytics</title>
<author>
<name sortKey="Rowe, Will Pm" sort="Rowe, Will Pm" uniqKey="Rowe W" first="Will Pm" last="Rowe">Will Pm Rowe</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0727 2226</institution-id>
<institution-id institution-id-type="GRID">grid.482271.a</institution-id>
<institution>Scientific Computing Department,</institution>
<institution>STFC Daresbury Laboratory,</institution>
</institution-wrap>
Warrington, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Warrington</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Carrieri, Anna Paola" sort="Carrieri, Anna Paola" uniqKey="Carrieri A" first="Anna Paola" last="Carrieri">Anna Paola Carrieri</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="GRID">grid.14467.30</institution-id>
<institution>IBM Research,</institution>
<institution>The Hartree Centre,</institution>
</institution-wrap>
Warrington, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Warrington</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Alcon Giner, Cristina" sort="Alcon Giner, Cristina" uniqKey="Alcon Giner C" first="Cristina" last="Alcon-Giner">Cristina Alcon-Giner</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="GRID">grid.420132.6</institution-id>
<institution>Quadram Institute Bioscience,</institution>
<institution>Norwich Research Park,</institution>
</institution-wrap>
Norwich, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Norwich</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Caim, Shabhonam" sort="Caim, Shabhonam" uniqKey="Caim S" first="Shabhonam" last="Caim">Shabhonam Caim</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="GRID">grid.420132.6</institution-id>
<institution>Quadram Institute Bioscience,</institution>
<institution>Norwich Research Park,</institution>
</institution-wrap>
Norwich, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Norwich</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Shaw, Alex" sort="Shaw, Alex" uniqKey="Shaw A" first="Alex" last="Shaw">Alex Shaw</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff4">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2113 8111</institution-id>
<institution-id institution-id-type="GRID">grid.7445.2</institution-id>
<institution>Department of Medicine, Section of Paediatrics,</institution>
<institution>Imperial College London,</institution>
</institution-wrap>
London, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>London</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Sim, Kathleen" sort="Sim, Kathleen" uniqKey="Sim K" first="Kathleen" last="Sim">Kathleen Sim</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff4">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2113 8111</institution-id>
<institution-id institution-id-type="GRID">grid.7445.2</institution-id>
<institution>Department of Medicine, Section of Paediatrics,</institution>
<institution>Imperial College London,</institution>
</institution-wrap>
London, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>London</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Kroll, J Simon" sort="Kroll, J Simon" uniqKey="Kroll J" first="J. Simon" last="Kroll">J. Simon Kroll</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff4">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2113 8111</institution-id>
<institution-id institution-id-type="GRID">grid.7445.2</institution-id>
<institution>Department of Medicine, Section of Paediatrics,</institution>
<institution>Imperial College London,</institution>
</institution-wrap>
London, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>London</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Hall, Lindsay J" sort="Hall, Lindsay J" uniqKey="Hall L" first="Lindsay J." last="Hall">Lindsay J. Hall</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="GRID">grid.420132.6</institution-id>
<institution>Quadram Institute Bioscience,</institution>
<institution>Norwich Research Park,</institution>
</institution-wrap>
Norwich, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Norwich</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Pyzer Knapp, Edward O" sort="Pyzer Knapp, Edward O" uniqKey="Pyzer Knapp E" first="Edward O." last="Pyzer-Knapp">Edward O. Pyzer-Knapp</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="GRID">grid.14467.30</institution-id>
<institution>IBM Research,</institution>
<institution>The Hartree Centre,</institution>
</institution-wrap>
Warrington, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Warrington</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Winn, Martyn D" sort="Winn, Martyn D" uniqKey="Winn M" first="Martyn D." last="Winn">Martyn D. Winn</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0727 2226</institution-id>
<institution-id institution-id-type="GRID">grid.482271.a</institution-id>
<institution>Scientific Computing Department,</institution>
<institution>STFC Daresbury Laboratory,</institution>
</institution-wrap>
Warrington, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Warrington</wicri:regionArea>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Microbiome</title>
<idno type="eISSN">2049-2618</idno>
<imprint>
<date when="2019">2019</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p id="Par1">The growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research, allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works. This vast amount of microbiome data, as well as the widespread proliferation of microbiome research and the looming era of clinical metagenomics, means there is an urgent need to develop analytics that can process huge amounts of data in a short amount of time.</p>
<p id="Par2">To address this need, we propose a new method for tyrhe compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. These sketches allow for dissimilarity estimation, rapid microbiome catalogue searching and classification of microbiome samples in near real time.</p>
</sec>
<sec>
<title>Results</title>
<p id="Par3">We apply streaming histogram sketching to microbiome samples as a form of dimensionality reduction, creating a compressed ‘histosketch’ that can efficiently represent microbiome k-mer spectra. Using public microbiome datasets, we show that histosketches can be clustered by sample type using the pairwise Jaccard similarity estimation, consequently allowing for rapid microbiome similarity searches via a locality sensitive hashing indexing scheme.</p>
<p id="Par4">Furthermore, we use a ‘real life’ example to show that histosketches can train machine learning classifiers to accurately label microbiome samples. Specifically, using a collection of 108 novel microbiome samples from a cohort of premature neonates, we trained and tested a random forest classifier that could accurately predict whether the neonate had received antibiotic treatment (97% accuracy, 96% precision) and could subsequently be used to classify microbiome data streams in less than 3 s.</p>
</sec>
<sec>
<title>Conclusions</title>
<p id="Par5">Our method offers a new approach to rapidly process microbiome data streams, allowing samples to be rapidly clustered, indexed and classified. We also provide our implementation, Histosketching Using Little K-mers (HULK), which can histosketch a typical 2 GB microbiome in 50 s on a standard laptop using four cores, with the sketch occupying 3000 bytes of disk space. (
<ext-link ext-link-type="uri" xlink:href="https://github.com/will-rowe/hulk">https://github.com/will-rowe/hulk</ext-link>
).</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Thompson, Lr" uniqKey="Thompson L">LR Thompson</name>
</author>
<author>
<name sortKey="Sanders, Jg" uniqKey="Sanders J">JG Sanders</name>
</author>
<author>
<name sortKey="Mcdonald, D" uniqKey="Mcdonald D">D McDonald</name>
</author>
<author>
<name sortKey="Amir, A" uniqKey="Amir A">A Amir</name>
</author>
<author>
<name sortKey="Ladau, J" uniqKey="Ladau J">J Ladau</name>
</author>
<author>
<name sortKey="Locey, Kj" uniqKey="Locey K">KJ Locey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rusch, Db" uniqKey="Rusch D">DB Rusch</name>
</author>
<author>
<name sortKey="Halpern, Al" uniqKey="Halpern A">AL Halpern</name>
</author>
<author>
<name sortKey="Sutton, G" uniqKey="Sutton G">G Sutton</name>
</author>
<author>
<name sortKey="Heidelberg, Kb" uniqKey="Heidelberg K">KB Heidelberg</name>
</author>
<author>
<name sortKey="Williamson, S" uniqKey="Williamson S">S Williamson</name>
</author>
<author>
<name sortKey="Yooseph, S" uniqKey="Yooseph S">S Yooseph</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mulcahy O Rady, H" uniqKey="Mulcahy O Rady H">H Mulcahy-O’Grady</name>
</author>
<author>
<name sortKey="Workentine, Ml" uniqKey="Workentine M">ML Workentine</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Forbes, Jd" uniqKey="Forbes J">JD Forbes</name>
</author>
<author>
<name sortKey="Knox, Nc" uniqKey="Knox N">NC Knox</name>
</author>
<author>
<name sortKey="Peterson, C L" uniqKey="Peterson C">C-L Peterson</name>
</author>
<author>
<name sortKey="Reimer, Ar" uniqKey="Reimer A">AR Reimer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Greninger, Al" uniqKey="Greninger A">AL Greninger</name>
</author>
<author>
<name sortKey="Naccache, Sn" uniqKey="Naccache S">SN Naccache</name>
</author>
<author>
<name sortKey="Federman, S" uniqKey="Federman S">S Federman</name>
</author>
<author>
<name sortKey="Yu, G" uniqKey="Yu G">G Yu</name>
</author>
<author>
<name sortKey="Mbala, P" uniqKey="Mbala P">P Mbala</name>
</author>
<author>
<name sortKey="Bres, V" uniqKey="Bres V">V Bres</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kakkanatt, C" uniqKey="Kakkanatt C">C Kakkanatt</name>
</author>
<author>
<name sortKey="Benigno, M" uniqKey="Benigno M">M Benigno</name>
</author>
<author>
<name sortKey="Jackson, Vm" uniqKey="Jackson V">VM Jackson</name>
</author>
<author>
<name sortKey="Huang, Pl" uniqKey="Huang P">PL Huang</name>
</author>
<author>
<name sortKey="Ng, K" uniqKey="Ng K">K Ng</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Morgan, Xc" uniqKey="Morgan X">XC Morgan</name>
</author>
<author>
<name sortKey="Huttenhower, C" uniqKey="Huttenhower C">C Huttenhower</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dubinkina, Vb" uniqKey="Dubinkina V">VB Dubinkina</name>
</author>
<author>
<name sortKey="Ischenko, Ds" uniqKey="Ischenko D">DS Ischenko</name>
</author>
<author>
<name sortKey="Ulyantsev, Vi" uniqKey="Ulyantsev V">VI Ulyantsev</name>
</author>
<author>
<name sortKey="Tyakht, Av" uniqKey="Tyakht A">AV Tyakht</name>
</author>
<author>
<name sortKey="Alexeev, Dg" uniqKey="Alexeev D">DG Alexeev</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Benoit, G" uniqKey="Benoit G">G Benoit</name>
</author>
<author>
<name sortKey="Peterlongo, P" uniqKey="Peterlongo P">P Peterlongo</name>
</author>
<author>
<name sortKey="Mariadassou, M" uniqKey="Mariadassou M">M Mariadassou</name>
</author>
<author>
<name sortKey="Drezen, E" uniqKey="Drezen E">E Drezen</name>
</author>
<author>
<name sortKey="Schbath, S" uniqKey="Schbath S">S Schbath</name>
</author>
<author>
<name sortKey="Lavenier, D" uniqKey="Lavenier D">D Lavenier</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Anvar, Sy" uniqKey="Anvar S">SY Anvar</name>
</author>
<author>
<name sortKey="Khachatryan, L" uniqKey="Khachatryan L">L Khachatryan</name>
</author>
<author>
<name sortKey="Vermaat, M" uniqKey="Vermaat M">M Vermaat</name>
</author>
<author>
<name sortKey="Van Galen, M" uniqKey="Van Galen M">M van Galen</name>
</author>
<author>
<name sortKey="Pulyakhina, I" uniqKey="Pulyakhina I">I Pulyakhina</name>
</author>
<author>
<name sortKey="Ariyurek, Y" uniqKey="Ariyurek Y">Y Ariyurek</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Libbrecht, Mw" uniqKey="Libbrecht M">MW Libbrecht</name>
</author>
<author>
<name sortKey="Noble, Ws" uniqKey="Noble W">WS Noble</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Seth, S" uniqKey="Seth S">S Seth</name>
</author>
<author>
<name sortKey="V Lim Ki, N" uniqKey="V Lim Ki N">N Välimäki</name>
</author>
<author>
<name sortKey="Kaski, S" uniqKey="Kaski S">S Kaski</name>
</author>
<author>
<name sortKey="Honkela, A" uniqKey="Honkela A">A Honkela</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ondov, Bd" uniqKey="Ondov B">BD Ondov</name>
</author>
<author>
<name sortKey="Treangen, Tj" uniqKey="Treangen T">TJ Treangen</name>
</author>
<author>
<name sortKey="Melsted, P" uniqKey="Melsted P">P Melsted</name>
</author>
<author>
<name sortKey="Mallonee, Ab" uniqKey="Mallonee A">AB Mallonee</name>
</author>
<author>
<name sortKey="Bergman, Nh" uniqKey="Bergman N">NH Bergman</name>
</author>
<author>
<name sortKey="Koren, S" uniqKey="Koren S">S Koren</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brown, T" uniqKey="Brown T">T Brown</name>
</author>
<author>
<name sortKey="Irber, L" uniqKey="Irber L">L Irber</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bovee, R" uniqKey="Bovee R">R Bovee</name>
</author>
<author>
<name sortKey="Greenfield, N" uniqKey="Greenfield N">N Greenfield</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wu, W" uniqKey="Wu W">W Wu</name>
</author>
<author>
<name sortKey="Li, B" uniqKey="Li B">B Li</name>
</author>
<author>
<name sortKey="Chen, L" uniqKey="Chen L">L Chen</name>
</author>
<author>
<name sortKey="Zhang, C" uniqKey="Zhang C">C Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ioffe, S" uniqKey="Ioffe S">S Ioffe</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, D" uniqKey="Yang D">D Yang</name>
</author>
<author>
<name sortKey="Li, B" uniqKey="Li B">B Li</name>
</author>
<author>
<name sortKey="Rettig, L" uniqKey="Rettig L">L Rettig</name>
</author>
<author>
<name sortKey="Cudre Mauroux, P" uniqKey="Cudre Mauroux P">P Cudré-Mauroux</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, Q" uniqKey="Zhang Q">Q Zhang</name>
</author>
<author>
<name sortKey="Pell, J" uniqKey="Pell J">J Pell</name>
</author>
<author>
<name sortKey="Canino Koning, R" uniqKey="Canino Koning R">R Canino-Koning</name>
</author>
<author>
<name sortKey="Howe, Ac" uniqKey="Howe A">AC Howe</name>
</author>
<author>
<name sortKey="Brown, Ct" uniqKey="Brown C">CT Brown</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Haveliwala, T" uniqKey="Haveliwala T">T Haveliwala</name>
</author>
<author>
<name sortKey="Gionis, A" uniqKey="Gionis A">A Gionis</name>
</author>
<author>
<name sortKey="Indyk, P" uniqKey="Indyk P">P Indyk</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cormode, G" uniqKey="Cormode G">G Cormode</name>
</author>
<author>
<name sortKey="Muthukrishnan, S" uniqKey="Muthukrishnan S">S Muthukrishnan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Koychev, I" uniqKey="Koychev I">I Koychev</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gruning, B" uniqKey="Gruning B">B Grüning</name>
</author>
<author>
<name sortKey="Dale, R" uniqKey="Dale R">R Dale</name>
</author>
<author>
<name sortKey="Sjodin, A" uniqKey="Sjodin A">A Sjödin</name>
</author>
<author>
<name sortKey="Chapman, Ba" uniqKey="Chapman B">BA Chapman</name>
</author>
<author>
<name sortKey="Rowe, J" uniqKey="Rowe J">J Rowe</name>
</author>
<author>
<name sortKey="Tomkins Tinch, Ch" uniqKey="Tomkins Tinch C">CH Tomkins-Tinch</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bawa, M" uniqKey="Bawa M">M Bawa</name>
</author>
<author>
<name sortKey="Condie, T" uniqKey="Condie T">T Condie</name>
</author>
<author>
<name sortKey="Ganesan, P" uniqKey="Ganesan P">P Ganesan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pedregosa, F" uniqKey="Pedregosa F">F Pedregosa</name>
</author>
<author>
<name sortKey="Varoquaux, G" uniqKey="Varoquaux G">G Varoquaux</name>
</author>
<author>
<name sortKey="Gramfort, A" uniqKey="Gramfort A">A Gramfort</name>
</author>
<author>
<name sortKey="Michel, V" uniqKey="Michel V">V Michel</name>
</author>
<author>
<name sortKey="Thirion, B" uniqKey="Thirion B">B Thirion</name>
</author>
<author>
<name sortKey="Grisel, O" uniqKey="Grisel O">O Grisel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sczyrba, A" uniqKey="Sczyrba A">A Sczyrba</name>
</author>
<author>
<name sortKey="Hofmann, P" uniqKey="Hofmann P">P Hofmann</name>
</author>
<author>
<name sortKey="Belmann, P" uniqKey="Belmann P">P Belmann</name>
</author>
<author>
<name sortKey="Koslicki, D" uniqKey="Koslicki D">D Koslicki</name>
</author>
<author>
<name sortKey="Janssen, S" uniqKey="Janssen S">S Janssen</name>
</author>
<author>
<name sortKey="Droge, J" uniqKey="Droge J">J Dröge</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Coelho, Lp" uniqKey="Coelho L">LP Coelho</name>
</author>
<author>
<name sortKey="Kultima, Jr" uniqKey="Kultima J">JR Kultima</name>
</author>
<author>
<name sortKey="Costea, Pi" uniqKey="Costea P">PI Costea</name>
</author>
<author>
<name sortKey="Fournier, C" uniqKey="Fournier C">C Fournier</name>
</author>
<author>
<name sortKey="Pan, Y" uniqKey="Pan Y">Y Pan</name>
</author>
<author>
<name sortKey="Czarnecki Maulden, G" uniqKey="Czarnecki Maulden G">G Czarnecki-Maulden</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Alcon Giner, C" uniqKey="Alcon Giner C">C Alcon-Giner</name>
</author>
<author>
<name sortKey="Caim, S" uniqKey="Caim S">S Caim</name>
</author>
<author>
<name sortKey="Mitra, S" uniqKey="Mitra S">S Mitra</name>
</author>
<author>
<name sortKey="Ketskemety, J" uniqKey="Ketskemety J">J Ketskemety</name>
</author>
<author>
<name sortKey="Wegmann, U" uniqKey="Wegmann U">U Wegmann</name>
</author>
<author>
<name sortKey="Wain, J" uniqKey="Wain J">J Wain</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sim, K" uniqKey="Sim K">K Sim</name>
</author>
<author>
<name sortKey="Shaw, Ag" uniqKey="Shaw A">AG Shaw</name>
</author>
<author>
<name sortKey="Randell, P" uniqKey="Randell P">P Randell</name>
</author>
<author>
<name sortKey="Cox, Mj" uniqKey="Cox M">MJ Cox</name>
</author>
<author>
<name sortKey="Mcclure, Ze" uniqKey="Mcclure Z">ZE McClure</name>
</author>
<author>
<name sortKey="Li, M S" uniqKey="Li M">M-S Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shaw, Ag" uniqKey="Shaw A">AG Shaw</name>
</author>
<author>
<name sortKey="Sim, K" uniqKey="Sim K">K Sim</name>
</author>
<author>
<name sortKey="Randell, P" uniqKey="Randell P">P Randell</name>
</author>
<author>
<name sortKey="Cox, Mj" uniqKey="Cox M">MJ Cox</name>
</author>
<author>
<name sortKey="Mcclure, Ze" uniqKey="Mcclure Z">ZE McClure</name>
</author>
<author>
<name sortKey="Li, M S" uniqKey="Li M">M-S Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Carrieri, Ap" uniqKey="Carrieri A">AP Carrieri</name>
</author>
<author>
<name sortKey="Rowe, Wpm" uniqKey="Rowe W">WPM Rowe</name>
</author>
<author>
<name sortKey="Winn, Md" uniqKey="Winn M">MD Winn</name>
</author>
<author>
<name sortKey="Pyzer Knapp, Eo" uniqKey="Pyzer Knapp E">EO Pyzer-Knapp</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Microbiome</journal-id>
<journal-id journal-id-type="iso-abbrev">Microbiome</journal-id>
<journal-title-group>
<journal-title>Microbiome</journal-title>
</journal-title-group>
<issn pub-type="epub">2049-2618</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">30878035</article-id>
<article-id pub-id-type="pmc">6420756</article-id>
<article-id pub-id-type="publisher-id">653</article-id>
<article-id pub-id-type="doi">10.1186/s40168-019-0653-2</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Methodology</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Streaming histogram sketching for rapid microbiome analytics</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes" equal-contrib="yes">
<contrib-id contrib-id-type="orcid">http://orcid.org/0000-0003-0384-4463</contrib-id>
<name>
<surname>Rowe</surname>
<given-names>Will PM</given-names>
</name>
<address>
<email>will.rowe@stfc.ac.uk</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author" equal-contrib="yes">
<name>
<surname>Carrieri</surname>
<given-names>Anna Paola</given-names>
</name>
<address>
<email>acarrieri@uk.ibm.com</email>
</address>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Alcon-Giner</surname>
<given-names>Cristina</given-names>
</name>
<address>
<email>Cristina.Alcon@quadram.ac.uk</email>
</address>
<xref ref-type="aff" rid="Aff3">3</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Caim</surname>
<given-names>Shabhonam</given-names>
</name>
<address>
<email>Shabhonam.Caim@quadram.ac.uk</email>
</address>
<xref ref-type="aff" rid="Aff3">3</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Shaw</surname>
<given-names>Alex</given-names>
</name>
<address>
<email>a.shaw@imperial.ac.uk</email>
</address>
<xref ref-type="aff" rid="Aff4">4</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Sim</surname>
<given-names>Kathleen</given-names>
</name>
<address>
<email>k.sim@imperial.ac.uk</email>
</address>
<xref ref-type="aff" rid="Aff4">4</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Kroll</surname>
<given-names>J. Simon</given-names>
</name>
<address>
<email>s.kroll@imperial.ac.uk</email>
</address>
<xref ref-type="aff" rid="Aff4">4</xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Hall</surname>
<given-names>Lindsay J.</given-names>
</name>
<address>
<email>Lindsay.Hall@quadram.ac.uk</email>
</address>
<xref ref-type="aff" rid="Aff3">3</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Pyzer-Knapp</surname>
<given-names>Edward O.</given-names>
</name>
<address>
<email>EPyzerK3@uk.ibm.com</email>
</address>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Winn</surname>
<given-names>Martyn D.</given-names>
</name>
<address>
<email>martyn.winn@stfc.ac.uk</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<aff id="Aff1">
<label>1</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0727 2226</institution-id>
<institution-id institution-id-type="GRID">grid.482271.a</institution-id>
<institution>Scientific Computing Department,</institution>
<institution>STFC Daresbury Laboratory,</institution>
</institution-wrap>
Warrington, UK</aff>
<aff id="Aff2">
<label>2</label>
<institution-wrap>
<institution-id institution-id-type="GRID">grid.14467.30</institution-id>
<institution>IBM Research,</institution>
<institution>The Hartree Centre,</institution>
</institution-wrap>
Warrington, UK</aff>
<aff id="Aff3">
<label>3</label>
<institution-wrap>
<institution-id institution-id-type="GRID">grid.420132.6</institution-id>
<institution>Quadram Institute Bioscience,</institution>
<institution>Norwich Research Park,</institution>
</institution-wrap>
Norwich, UK</aff>
<aff id="Aff4">
<label>4</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 2113 8111</institution-id>
<institution-id institution-id-type="GRID">grid.7445.2</institution-id>
<institution>Department of Medicine, Section of Paediatrics,</institution>
<institution>Imperial College London,</institution>
</institution-wrap>
London, UK</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>16</day>
<month>3</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>16</day>
<month>3</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="collection">
<year>2019</year>
</pub-date>
<volume>7</volume>
<elocation-id>40</elocation-id>
<history>
<date date-type="received">
<day>25</day>
<month>9</month>
<year>2018</year>
</date>
<date date-type="accepted">
<day>1</day>
<month>3</month>
<year>2019</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s). 2019</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p id="Par1">The growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research, allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works. This vast amount of microbiome data, as well as the widespread proliferation of microbiome research and the looming era of clinical metagenomics, means there is an urgent need to develop analytics that can process huge amounts of data in a short amount of time.</p>
<p id="Par2">To address this need, we propose a new method for tyrhe compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. These sketches allow for dissimilarity estimation, rapid microbiome catalogue searching and classification of microbiome samples in near real time.</p>
</sec>
<sec>
<title>Results</title>
<p id="Par3">We apply streaming histogram sketching to microbiome samples as a form of dimensionality reduction, creating a compressed ‘histosketch’ that can efficiently represent microbiome k-mer spectra. Using public microbiome datasets, we show that histosketches can be clustered by sample type using the pairwise Jaccard similarity estimation, consequently allowing for rapid microbiome similarity searches via a locality sensitive hashing indexing scheme.</p>
<p id="Par4">Furthermore, we use a ‘real life’ example to show that histosketches can train machine learning classifiers to accurately label microbiome samples. Specifically, using a collection of 108 novel microbiome samples from a cohort of premature neonates, we trained and tested a random forest classifier that could accurately predict whether the neonate had received antibiotic treatment (97% accuracy, 96% precision) and could subsequently be used to classify microbiome data streams in less than 3 s.</p>
</sec>
<sec>
<title>Conclusions</title>
<p id="Par5">Our method offers a new approach to rapidly process microbiome data streams, allowing samples to be rapidly clustered, indexed and classified. We also provide our implementation, Histosketching Using Little K-mers (HULK), which can histosketch a typical 2 GB microbiome in 50 s on a standard laptop using four cores, with the sketch occupying 3000 bytes of disk space. (
<ext-link ext-link-type="uri" xlink:href="https://github.com/will-rowe/hulk">https://github.com/will-rowe/hulk</ext-link>
).</p>
</sec>
</abstract>
<funding-group>
<award-group>
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/501100000271</institution-id>
<institution>Science and Technology Facilities Council</institution>
</institution-wrap>
</funding-source>
</award-group>
</funding-group>
<funding-group>
<award-group>
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/100004440</institution-id>
<institution>Wellcome Trust</institution>
</institution-wrap>
</funding-source>
<award-id>100/974/C/13/Z</award-id>
<principal-award-recipient>
<name>
<surname>Hall</surname>
<given-names>Lindsay J.</given-names>
</name>
</principal-award-recipient>
</award-group>
</funding-group>
<funding-group>
<award-group>
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/501100000268</institution-id>
<institution>Biotechnology and Biological Sciences Research Council</institution>
</institution-wrap>
</funding-source>
<award-id>BB/M011216/1</award-id>
<award-id>BB/R012490/1</award-id>
<principal-award-recipient>
<name>
<surname>Alcon-Giner</surname>
<given-names>Cristina</given-names>
</name>
<name>
<surname>Hall</surname>
<given-names>Lindsay J.</given-names>
</name>
</principal-award-recipient>
</award-group>
</funding-group>
<funding-group>
<award-group>
<funding-source>
<institution>Winnicott Foundation</institution>
</funding-source>
<award-id>None</award-id>
</award-group>
</funding-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2019</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000398 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd -nk 000398 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Curation
   |type=    RBID
   |clé=     PMC:6420756
   |texte=   Streaming histogram sketching for rapid microbiome analytics
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Curation/RBID.i   -Sk "pubmed:30878035" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021