Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters

Identifieur interne : 000D98 ( Pmc/Curation ); précédent : 000D97; suivant : 000D99

Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters

Auteurs : David Pellow ; Darya Filippova ; Carl Kingsford

Source :

RBID : PMC:5467106

Abstract

Abstract

Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 – 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.


Url:
DOI: 10.1089/cmb.2016.0155
PubMed: 27828710
PubMed Central: 5467106

Links toward previous steps (curation, corpus...)


Links to Exploration step

PMC:5467106

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Improving Bloom Filter Performance on Sequence Data Using
<italic>k</italic>
-mer Bloom Filters</title>
<author>
<name sortKey="Pellow, David" sort="Pellow, David" uniqKey="Pellow D" first="David" last="Pellow">David Pellow</name>
<affiliation>
<nlm:aff id="aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Filippova, Darya" sort="Filippova, Darya" uniqKey="Filippova D" first="Darya" last="Filippova">Darya Filippova</name>
<affiliation>
<nlm:aff id="aff2"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<affiliation>
<nlm:aff id="aff3"></nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">27828710</idno>
<idno type="pmc">5467106</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5467106</idno>
<idno type="RBID">PMC:5467106</idno>
<idno type="doi">10.1089/cmb.2016.0155</idno>
<date when="2017">2017</date>
<idno type="wicri:Area/Pmc/Corpus">000D98</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000D98</idno>
<idno type="wicri:Area/Pmc/Curation">000D98</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">000D98</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Improving Bloom Filter Performance on Sequence Data Using
<italic>k</italic>
-mer Bloom Filters</title>
<author>
<name sortKey="Pellow, David" sort="Pellow, David" uniqKey="Pellow D" first="David" last="Pellow">David Pellow</name>
<affiliation>
<nlm:aff id="aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Filippova, Darya" sort="Filippova, Darya" uniqKey="Filippova D" first="Darya" last="Filippova">Darya Filippova</name>
<affiliation>
<nlm:aff id="aff2"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<affiliation>
<nlm:aff id="aff3"></nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Journal of Computational Biology</title>
<idno type="ISSN">1066-5277</idno>
<idno type="eISSN">1557-8666</idno>
<imprint>
<date when="2017">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<title>Abstract</title>
<p>
<bold>Using a sequence's
<italic>k</italic>
-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As
<italic>k</italic>
-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for
<italic>k</italic>
-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of
<italic>k</italic>
-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because
<italic>k</italic>
-mers are derived from sequencing reads, the information about
<italic>k</italic>
-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 – 1.6 times slower. Alternatively, we can leverage
<italic>k</italic>
-mer overlap information to store
<italic>k</italic>
-mer sets in about half the space while maintaining the original FPR. We consider several variants of such
<italic>k</italic>
-mer Bloom filters (
<italic>k</italic>
BFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.</bold>
</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Pellow, D" uniqKey="Pellow D">D. Pellow</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Benoit, G" uniqKey="Benoit G">G. Benoit</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bloom, B" uniqKey="Bloom B">B. Bloom</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Broder, A" uniqKey="Broder A">A. Broder</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R. Chikhi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Heo, Y" uniqKey="Heo Y">Y. Heo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Holley, G" uniqKey="Holley G">G. Holley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Malde, K" uniqKey="Malde K">K. Malde</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G. Marçais</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Patro, R" uniqKey="Patro R">R. Patro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pell, J" uniqKey="Pell J">J. Pell</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pellow, D" uniqKey="Pellow D">D. Pellow</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rozov, R" uniqKey="Rozov R">R. Rozov</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salikhov, K" uniqKey="Salikhov K">K. Salikhov</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shi, H" uniqKey="Shi H">H. Shi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Solomon, B" uniqKey="Solomon B">B. Solomon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Song, L" uniqKey="Song L">L. Song</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stranneheim, H" uniqKey="Stranneheim H">H. Stranneheim</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wood, D" uniqKey="Wood D">D. Wood</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yu, Y" uniqKey="Yu Y">Y. Yu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zerbino, D" uniqKey="Zerbino D">D. Zerbino</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">J Comput Biol</journal-id>
<journal-id journal-id-type="iso-abbrev">J. Comput. Biol</journal-id>
<journal-id journal-id-type="publisher-id">cmb</journal-id>
<journal-title-group>
<journal-title>Journal of Computational Biology</journal-title>
</journal-title-group>
<issn pub-type="ppub">1066-5277</issn>
<issn pub-type="epub">1557-8666</issn>
<publisher>
<publisher-name>Mary Ann Liebert, Inc.</publisher-name>
<publisher-loc>140 Huguenot Street, 3rd FloorNew Rochelle, NY 10801USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">27828710</article-id>
<article-id pub-id-type="pmc">5467106</article-id>
<article-id pub-id-type="publisher-id">10.1089/cmb.2016.0155</article-id>
<article-id pub-id-type="doi">10.1089/cmb.2016.0155</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Articles</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Improving Bloom Filter Performance on Sequence Data Using
<italic>k</italic>
-mer Bloom Filters</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Pellow</surname>
<given-names>David</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1,</sup>
</xref>
<xref ref-type="author-notes" rid="fn1">
<sup>*</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Filippova</surname>
<given-names>Darya</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2,</sup>
</xref>
<xref ref-type="author-notes" rid="fn2">
<sup></sup>
</xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Kingsford</surname>
<given-names>Carl</given-names>
</name>
<xref ref-type="aff" rid="aff3">
<sup>3</sup>
</xref>
</contrib>
<aff id="aff1">
<label>
<sup>1</sup>
</label>
The Blavatnik School of Computer Science,
<institution>Tel Aviv University</institution>
, Tel Aviv,
<country>Israel</country>
.</aff>
<aff id="aff2">
<label>
<sup>2</sup>
</label>
<institution>Roche Sequencing Solutions</institution>
, Pleasanton, California.</aff>
<aff id="aff3">
<label>
<sup>3</sup>
</label>
Computational Biology Department, School of Computer Science,
<institution>Carnegie Mellon University</institution>
, Pittsburgh, Pennsylvania.</aff>
</contrib-group>
<author-notes>
<fn id="fn1" fn-type="other">
<label>
<sup>*</sup>
</label>
<p>Work primarily completed at the Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania.</p>
</fn>
<fn id="fn2" fn-type="other">
<label>
<sup></sup>
</label>
<p>Work primarily completed at the Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania.</p>
</fn>
<corresp>
<addr-line>Address correspondence to:</addr-line>
<addr-line>
<italic>Carl Kingsford, Associate Professor</italic>
</addr-line>
<addr-line>
<italic>Computational Biology Department</italic>
</addr-line>
<addr-line>
<italic>School of Computer Science</italic>
</addr-line>
<institution>
<italic>Carnegie Mellon University</italic>
</institution>
<addr-line>
<italic>5000 Forbes Avenue</italic>
</addr-line>
<addr-line>
<italic>Pittsburgh, PA 15213</italic>
</addr-line>
<italic>E-mail:</italic>
<email xlink:href="mailto:carlk@cs.cmu.edu">carlk@cs.cmu.edu</email>
</corresp>
</author-notes>
<pub-date pub-type="ppub">
<day>01</day>
<month>6</month>
<year>2017</year>
<pmc-comment>string-date: June 2017</pmc-comment>
</pub-date>
<pub-date pub-type="epub">
<day>01</day>
<month>6</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>01</day>
<month>6</month>
<year>2017</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>24</volume>
<issue>6</issue>
<fpage>547</fpage>
<lpage>557</lpage>
<permissions>
<copyright-statement>© David Pellow, et al., 2016. Published by Mary Ann Liebert, Inc.</copyright-statement>
<copyright-year>2017</copyright-year>
<license license-type="open-access">
<license-p>This Open Access article is distributed under the terms of the Creative Commons License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0">http://creativecommons.org/licenses/by/4.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:type="simple" xlink:href="cmb.2016.0155.pdf"></self-uri>
<abstract>
<title>Abstract</title>
<p>
<bold>Using a sequence's
<italic>k</italic>
-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As
<italic>k</italic>
-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for
<italic>k</italic>
-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of
<italic>k</italic>
-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because
<italic>k</italic>
-mers are derived from sequencing reads, the information about
<italic>k</italic>
-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 – 1.6 times slower. Alternatively, we can leverage
<italic>k</italic>
-mer overlap information to store
<italic>k</italic>
-mer sets in about half the space while maintaining the original FPR. We consider several variants of such
<italic>k</italic>
-mer Bloom filters (
<italic>k</italic>
BFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.</bold>
</p>
</abstract>
<kwd-group kwd-group-type="author">
<title>
<bold>Keywords:</bold>
</title>
<kwd>efficient data structures</kwd>
<kwd>genomics</kwd>
<kwd>string algorithms</kwd>
<kwd>Bloom fitters</kwd>
<kwd>
<italic>k</italic>
-mers.</kwd>
</kwd-group>
<counts>
<fig-count count="2"></fig-count>
<table-count count="8"></table-count>
<ref-count count="21"></ref-count>
<page-count count="11"></page-count>
</counts>
</article-meta>
</front>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000D98 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd -nk 000D98 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Curation
   |type=    RBID
   |clé=     PMC:5467106
   |texte=   Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Curation/RBID.i   -Sk "pubmed:27828710" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021