Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters
Identifieur interne : 000D98 ( Pmc/Curation ); précédent : 000D97; suivant : 000D99Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters
Auteurs : David Pellow ; Darya Filippova ; Carl KingsfordSource :
- Journal of Computational Biology [ 1066-5277 ] ; 2017.
Abstract
Url:
DOI: 10.1089/cmb.2016.0155
PubMed: 27828710
PubMed Central: 5467106
Links toward previous steps (curation, corpus...)
- to stream Pmc, to step Corpus: Pour aller vers cette notice dans l'étape Curation :000D98
Links to Exploration step
PMC:5467106Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Improving Bloom Filter Performance on Sequence Data Using <italic>k</italic>
-mer Bloom Filters</title>
<author><name sortKey="Pellow, David" sort="Pellow, David" uniqKey="Pellow D" first="David" last="Pellow">David Pellow</name>
<affiliation><nlm:aff id="aff1"></nlm:aff>
</affiliation>
</author>
<author><name sortKey="Filippova, Darya" sort="Filippova, Darya" uniqKey="Filippova D" first="Darya" last="Filippova">Darya Filippova</name>
<affiliation><nlm:aff id="aff2"></nlm:aff>
</affiliation>
</author>
<author><name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<affiliation><nlm:aff id="aff3"></nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">27828710</idno>
<idno type="pmc">5467106</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5467106</idno>
<idno type="RBID">PMC:5467106</idno>
<idno type="doi">10.1089/cmb.2016.0155</idno>
<date when="2017">2017</date>
<idno type="wicri:Area/Pmc/Corpus">000D98</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000D98</idno>
<idno type="wicri:Area/Pmc/Curation">000D98</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">000D98</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Improving Bloom Filter Performance on Sequence Data Using <italic>k</italic>
-mer Bloom Filters</title>
<author><name sortKey="Pellow, David" sort="Pellow, David" uniqKey="Pellow D" first="David" last="Pellow">David Pellow</name>
<affiliation><nlm:aff id="aff1"></nlm:aff>
</affiliation>
</author>
<author><name sortKey="Filippova, Darya" sort="Filippova, Darya" uniqKey="Filippova D" first="Darya" last="Filippova">Darya Filippova</name>
<affiliation><nlm:aff id="aff2"></nlm:aff>
</affiliation>
</author>
<author><name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<affiliation><nlm:aff id="aff3"></nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">Journal of Computational Biology</title>
<idno type="ISSN">1066-5277</idno>
<idno type="eISSN">1557-8666</idno>
<imprint><date when="2017">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><title>Abstract</title>
<p><bold>Using a sequence's <italic>k</italic>
-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As <italic>k</italic>
-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for <italic>k</italic>
-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of <italic>k</italic>
-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because <italic>k</italic>
-mers are derived from sequencing reads, the information about <italic>k</italic>
-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 – 1.6 times slower. Alternatively, we can leverage <italic>k</italic>
-mer overlap information to store <italic>k</italic>
-mer sets in about half the space while maintaining the original FPR. We consider several variants of such <italic>k</italic>
-mer Bloom filters (<italic>k</italic>
BFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.</bold>
</p>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Pellow, D" uniqKey="Pellow D">D. Pellow</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Benoit, G" uniqKey="Benoit G">G. Benoit</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bloom, B" uniqKey="Bloom B">B. Bloom</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Broder, A" uniqKey="Broder A">A. Broder</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Chikhi, R" uniqKey="Chikhi R">R. Chikhi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Heo, Y" uniqKey="Heo Y">Y. Heo</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Holley, G" uniqKey="Holley G">G. Holley</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Malde, K" uniqKey="Malde K">K. Malde</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Marcais, G" uniqKey="Marcais G">G. Marçais</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Patro, R" uniqKey="Patro R">R. Patro</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Pell, J" uniqKey="Pell J">J. Pell</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Pellow, D" uniqKey="Pellow D">D. Pellow</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Rozov, R" uniqKey="Rozov R">R. Rozov</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Salikhov, K" uniqKey="Salikhov K">K. Salikhov</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Shi, H" uniqKey="Shi H">H. Shi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Solomon, B" uniqKey="Solomon B">B. Solomon</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Song, L" uniqKey="Song L">L. Song</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Stranneheim, H" uniqKey="Stranneheim H">H. Stranneheim</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wood, D" uniqKey="Wood D">D. Wood</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Yu, Y" uniqKey="Yu Y">Y. Yu</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zerbino, D" uniqKey="Zerbino D">D. Zerbino</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
<front><journal-meta><journal-id journal-id-type="nlm-ta">J Comput Biol</journal-id>
<journal-id journal-id-type="iso-abbrev">J. Comput. Biol</journal-id>
<journal-id journal-id-type="publisher-id">cmb</journal-id>
<journal-title-group><journal-title>Journal of Computational Biology</journal-title>
</journal-title-group>
<issn pub-type="ppub">1066-5277</issn>
<issn pub-type="epub">1557-8666</issn>
<publisher><publisher-name>Mary Ann Liebert, Inc.</publisher-name>
<publisher-loc>140 Huguenot Street, 3rd FloorNew Rochelle, NY 10801USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">27828710</article-id>
<article-id pub-id-type="pmc">5467106</article-id>
<article-id pub-id-type="publisher-id">10.1089/cmb.2016.0155</article-id>
<article-id pub-id-type="doi">10.1089/cmb.2016.0155</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Research Articles</subject>
</subj-group>
</article-categories>
<title-group><article-title>Improving Bloom Filter Performance on Sequence Data Using <italic>k</italic>
-mer Bloom Filters</article-title>
</title-group>
<contrib-group><contrib contrib-type="author"><name><surname>Pellow</surname>
<given-names>David</given-names>
</name>
<xref ref-type="aff" rid="aff1"><sup>1,</sup>
</xref>
<xref ref-type="author-notes" rid="fn1"><sup>*</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Filippova</surname>
<given-names>Darya</given-names>
</name>
<xref ref-type="aff" rid="aff2"><sup>2,</sup>
</xref>
<xref ref-type="author-notes" rid="fn2"><sup>†</sup>
</xref>
</contrib>
<contrib contrib-type="author" corresp="yes"><name><surname>Kingsford</surname>
<given-names>Carl</given-names>
</name>
<xref ref-type="aff" rid="aff3"><sup>3</sup>
</xref>
</contrib>
<aff id="aff1"><label><sup>1</sup>
</label>
The Blavatnik School of Computer Science,<institution>Tel Aviv University</institution>
, Tel Aviv,<country>Israel</country>
.</aff>
<aff id="aff2"><label><sup>2</sup>
</label>
<institution>Roche Sequencing Solutions</institution>
, Pleasanton, California.</aff>
<aff id="aff3"><label><sup>3</sup>
</label>
Computational Biology Department, School of Computer Science,<institution>Carnegie Mellon University</institution>
, Pittsburgh, Pennsylvania.</aff>
</contrib-group>
<author-notes><fn id="fn1" fn-type="other"><label><sup>*</sup>
</label>
<p>Work primarily completed at the Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania.</p>
</fn>
<fn id="fn2" fn-type="other"><label><sup>†</sup>
</label>
<p>Work primarily completed at the Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania.</p>
</fn>
<corresp><addr-line>Address correspondence to:</addr-line>
<addr-line><italic>Carl Kingsford, Associate Professor</italic>
</addr-line>
<addr-line><italic>Computational Biology Department</italic>
</addr-line>
<addr-line><italic>School of Computer Science</italic>
</addr-line>
<institution><italic>Carnegie Mellon University</italic>
</institution>
<addr-line><italic>5000 Forbes Avenue</italic>
</addr-line>
<addr-line><italic>Pittsburgh, PA 15213</italic>
</addr-line>
<italic>E-mail:</italic>
<email xlink:href="mailto:carlk@cs.cmu.edu">carlk@cs.cmu.edu</email>
</corresp>
</author-notes>
<pub-date pub-type="ppub"><day>01</day>
<month>6</month>
<year>2017</year>
<pmc-comment>string-date: June 2017</pmc-comment>
</pub-date>
<pub-date pub-type="epub"><day>01</day>
<month>6</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="pmc-release"><day>01</day>
<month>6</month>
<year>2017</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>24</volume>
<issue>6</issue>
<fpage>547</fpage>
<lpage>557</lpage>
<permissions><copyright-statement>© David Pellow, et al., 2016. Published by Mary Ann Liebert, Inc.</copyright-statement>
<copyright-year>2017</copyright-year>
<license license-type="open-access"><license-p>This Open Access article is distributed under the terms of the Creative Commons License (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0">http://creativecommons.org/licenses/by/4.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:type="simple" xlink:href="cmb.2016.0155.pdf"></self-uri>
<abstract><title>Abstract</title>
<p><bold>Using a sequence's <italic>k</italic>
-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As <italic>k</italic>
-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for <italic>k</italic>
-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of <italic>k</italic>
-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because <italic>k</italic>
-mers are derived from sequencing reads, the information about <italic>k</italic>
-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 – 1.6 times slower. Alternatively, we can leverage <italic>k</italic>
-mer overlap information to store <italic>k</italic>
-mer sets in about half the space while maintaining the original FPR. We consider several variants of such <italic>k</italic>
-mer Bloom filters (<italic>k</italic>
BFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.</bold>
</p>
</abstract>
<kwd-group kwd-group-type="author"><title><bold>Keywords:</bold>
</title>
<kwd>efficient data structures</kwd>
<kwd>genomics</kwd>
<kwd>string algorithms</kwd>
<kwd>Bloom fitters</kwd>
<kwd><italic>k</italic>
-mers.</kwd>
</kwd-group>
<counts><fig-count count="2"></fig-count>
<table-count count="8"></table-count>
<ref-count count="21"></ref-count>
<page-count count="11"></page-count>
</counts>
</article-meta>
</front>
</pmc>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000D98 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd -nk 000D98 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Sante |area= MersV1 |flux= Pmc |étape= Curation |type= RBID |clé= PMC:5467106 |texte= Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters }}
Pour générer des pages wiki
HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Curation/RBID.i -Sk "pubmed:27828710" \ | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd \ | NlmPubMed2Wicri -a MersV1
This area was generated with Dilib version V0.6.33. |