Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.
Identifieur interne : 000C62 ( PubMed/Checkpoint ); précédent : 000C61; suivant : 000C63Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.
Auteurs : David Pellow [Israël] ; Darya Filippova [États-Unis] ; Carl Kingsford [États-Unis]Source :
- Journal of computational biology : a journal of computational molecular cell biology [ 1557-8666 ] ; 2017.
Descripteurs français
- KwdFr :
- MESH :
English descriptors
- KwdEn :
- MESH :
Abstract
Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 - 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.
DOI: 10.1089/cmb.2016.0155
PubMed: 27828710
Affiliations:
Links toward previous steps (curation, corpus...)
Links to Exploration step
pubmed:27828710Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.</title>
<author><name sortKey="Pellow, David" sort="Pellow, David" uniqKey="Pellow D" first="David" last="Pellow">David Pellow</name>
<affiliation wicri:level="1"><nlm:affiliation>1 The Blavatnik School of Computer Science, Tel Aviv University , Tel Aviv, Israel .</nlm:affiliation>
<country xml:lang="fr">Israël</country>
<wicri:regionArea>1 The Blavatnik School of Computer Science, Tel Aviv University , Tel Aviv</wicri:regionArea>
<wicri:noRegion>Tel Aviv</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Filippova, Darya" sort="Filippova, Darya" uniqKey="Filippova D" first="Darya" last="Filippova">Darya Filippova</name>
<affiliation wicri:level="2"><nlm:affiliation>2 Roche Sequencing Solutions , Pleasanton, California.</nlm:affiliation>
<country>États-Unis</country>
<placeName><region type="state">Californie</region>
</placeName>
<wicri:cityArea>2 Roche Sequencing Solutions , Pleasanton</wicri:cityArea>
</affiliation>
</author>
<author><name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<affiliation wicri:level="2"><nlm:affiliation>3 Computational Biology Department, School of Computer Science, Carnegie Mellon University , Pittsburgh, Pennsylvania.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<placeName><region type="state">Pennsylvanie</region>
</placeName>
<wicri:cityArea>3 Computational Biology Department, School of Computer Science, Carnegie Mellon University , Pittsburgh</wicri:cityArea>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PubMed</idno>
<date when="2017">2017</date>
<idno type="RBID">pubmed:27828710</idno>
<idno type="pmid">27828710</idno>
<idno type="doi">10.1089/cmb.2016.0155</idno>
<idno type="wicri:Area/PubMed/Corpus">000E99</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">000E99</idno>
<idno type="wicri:Area/PubMed/Curation">000E99</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">000E99</idno>
<idno type="wicri:Area/PubMed/Checkpoint">000C62</idno>
<idno type="wicri:explorRef" wicri:stream="Checkpoint" wicri:step="PubMed">000C62</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.</title>
<author><name sortKey="Pellow, David" sort="Pellow, David" uniqKey="Pellow D" first="David" last="Pellow">David Pellow</name>
<affiliation wicri:level="1"><nlm:affiliation>1 The Blavatnik School of Computer Science, Tel Aviv University , Tel Aviv, Israel .</nlm:affiliation>
<country xml:lang="fr">Israël</country>
<wicri:regionArea>1 The Blavatnik School of Computer Science, Tel Aviv University , Tel Aviv</wicri:regionArea>
<wicri:noRegion>Tel Aviv</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Filippova, Darya" sort="Filippova, Darya" uniqKey="Filippova D" first="Darya" last="Filippova">Darya Filippova</name>
<affiliation wicri:level="2"><nlm:affiliation>2 Roche Sequencing Solutions , Pleasanton, California.</nlm:affiliation>
<country>États-Unis</country>
<placeName><region type="state">Californie</region>
</placeName>
<wicri:cityArea>2 Roche Sequencing Solutions , Pleasanton</wicri:cityArea>
</affiliation>
</author>
<author><name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<affiliation wicri:level="2"><nlm:affiliation>3 Computational Biology Department, School of Computer Science, Carnegie Mellon University , Pittsburgh, Pennsylvania.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<placeName><region type="state">Pennsylvanie</region>
</placeName>
<wicri:cityArea>3 Computational Biology Department, School of Computer Science, Carnegie Mellon University , Pittsburgh</wicri:cityArea>
</affiliation>
</author>
</analytic>
<series><title level="j">Journal of computational biology : a journal of computational molecular cell biology</title>
<idno type="eISSN">1557-8666</idno>
<imprint><date when="2017" type="published">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Algorithms</term>
<term>Computational Biology (methods)</term>
<term>Computer Simulation</term>
<term>Humans</term>
<term>Probability</term>
<term>Sequence Analysis, DNA (methods)</term>
<term>Software</term>
</keywords>
<keywords scheme="KwdFr" xml:lang="fr"><term>Algorithmes</term>
<term>Analyse de séquence d'ADN ()</term>
<term>Biologie informatique ()</term>
<term>Humains</term>
<term>Logiciel</term>
<term>Probabilité</term>
<term>Simulation numérique</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en"><term>Computational Biology</term>
<term>Sequence Analysis, DNA</term>
</keywords>
<keywords scheme="MESH" xml:lang="en"><term>Algorithms</term>
<term>Computer Simulation</term>
<term>Humans</term>
<term>Probability</term>
<term>Software</term>
</keywords>
<keywords scheme="MESH" xml:lang="fr"><term>Algorithmes</term>
<term>Analyse de séquence d'ADN</term>
<term>Biologie informatique</term>
<term>Humains</term>
<term>Logiciel</term>
<term>Probabilité</term>
<term>Simulation numérique</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 - 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.</div>
</front>
</TEI>
<pubmed><MedlineCitation Status="MEDLINE" Owner="NLM"><PMID Version="1">27828710</PMID>
<DateCompleted><Year>2017</Year>
<Month>11</Month>
<Day>16</Day>
</DateCompleted>
<DateRevised><Year>2020</Year>
<Month>04</Month>
<Day>15</Day>
</DateRevised>
<Article PubModel="Print-Electronic"><Journal><ISSN IssnType="Electronic">1557-8666</ISSN>
<JournalIssue CitedMedium="Internet"><Volume>24</Volume>
<Issue>6</Issue>
<PubDate><Year>2017</Year>
<Month>Jun</Month>
</PubDate>
</JournalIssue>
<Title>Journal of computational biology : a journal of computational molecular cell biology</Title>
<ISOAbbreviation>J. Comput. Biol.</ISOAbbreviation>
</Journal>
<ArticleTitle>Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.</ArticleTitle>
<Pagination><MedlinePgn>547-557</MedlinePgn>
</Pagination>
<ELocationID EIdType="doi" ValidYN="Y">10.1089/cmb.2016.0155</ELocationID>
<Abstract><AbstractText>Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 - 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y"><Author ValidYN="Y"><LastName>Pellow</LastName>
<ForeName>David</ForeName>
<Initials>D</Initials>
<AffiliationInfo><Affiliation>1 The Blavatnik School of Computer Science, Tel Aviv University , Tel Aviv, Israel .</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y"><LastName>Filippova</LastName>
<ForeName>Darya</ForeName>
<Initials>D</Initials>
<AffiliationInfo><Affiliation>2 Roche Sequencing Solutions , Pleasanton, California.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y"><LastName>Kingsford</LastName>
<ForeName>Carl</ForeName>
<Initials>C</Initials>
<AffiliationInfo><Affiliation>3 Computational Biology Department, School of Computer Science, Carnegie Mellon University , Pittsburgh, Pennsylvania.</Affiliation>
</AffiliationInfo>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList><PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
<ArticleDate DateType="Electronic"><Year>2016</Year>
<Month>11</Month>
<Day>09</Day>
</ArticleDate>
</Article>
<MedlineJournalInfo><Country>United States</Country>
<MedlineTA>J Comput Biol</MedlineTA>
<NlmUniqueID>9433358</NlmUniqueID>
<ISSNLinking>1066-5277</ISSNLinking>
</MedlineJournalInfo>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList><MeshHeading><DescriptorName UI="D000465" MajorTopicYN="Y">Algorithms</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D019295" MajorTopicYN="N">Computational Biology</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="Y">methods</QualifierName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D003198" MajorTopicYN="N">Computer Simulation</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D006801" MajorTopicYN="N">Humans</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D011336" MajorTopicYN="N">Probability</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D017422" MajorTopicYN="N">Sequence Analysis, DNA</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="Y">methods</QualifierName>
</MeshHeading>
<MeshHeading><DescriptorName UI="D012984" MajorTopicYN="N">Software</DescriptorName>
</MeshHeading>
</MeshHeadingList>
<KeywordList Owner="NOTNLM"><Keyword MajorTopicYN="N">Bloom fitters</Keyword>
<Keyword MajorTopicYN="N">efficient data structures</Keyword>
<Keyword MajorTopicYN="N">genomics</Keyword>
<Keyword MajorTopicYN="N">k-mers.</Keyword>
<Keyword MajorTopicYN="N">string algorithms</Keyword>
</KeywordList>
</MedlineCitation>
<PubmedData><History><PubMedPubDate PubStatus="pubmed"><Year>2016</Year>
<Month>11</Month>
<Day>10</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline"><Year>2017</Year>
<Month>11</Month>
<Day>29</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez"><Year>2016</Year>
<Month>11</Month>
<Day>10</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList><ArticleId IdType="pubmed">27828710</ArticleId>
<ArticleId IdType="doi">10.1089/cmb.2016.0155</ArticleId>
<ArticleId IdType="pmc">PMC5467106</ArticleId>
</ArticleIdList>
<ReferenceList><Reference><Citation>Genome Res. 2008 May;18(5):821-9</Citation>
<ArticleIdList><ArticleId IdType="pubmed">18349386</ArticleId>
</ArticleIdList>
</Reference>
<Reference><Citation>Bioinformatics. 2010 Jul 1;26(13):1595-600</Citation>
<ArticleIdList><ArticleId IdType="pubmed">20472541</ArticleId>
</ArticleIdList>
</Reference>
<Reference><Citation>Bioinformatics. 2011 Mar 15;27(6):764-70</Citation>
<ArticleIdList><ArticleId IdType="pubmed">21217122</ArticleId>
</ArticleIdList>
</Reference>
<Reference><Citation>Proc Natl Acad Sci U S A. 2012 Aug 14;109(33):13272-7</Citation>
<ArticleIdList><ArticleId IdType="pubmed">22847406</ArticleId>
</ArticleIdList>
</Reference>
<Reference><Citation>Algorithms Mol Biol. 2013 Sep 16;8(1):22</Citation>
<ArticleIdList><ArticleId IdType="pubmed">24040893</ArticleId>
</ArticleIdList>
</Reference>
<Reference><Citation>Bioinformatics. 2014 May 15;30(10):1354-62</Citation>
<ArticleIdList><ArticleId IdType="pubmed">24451628</ArticleId>
</ArticleIdList>
</Reference>
<Reference><Citation>Algorithms Mol Biol. 2014 Feb 24;9(1):2</Citation>
<ArticleIdList><ArticleId IdType="pubmed">24565280</ArticleId>
</ArticleIdList>
</Reference>
<Reference><Citation>Genome Biol. 2014 Mar 03;15(3):R46</Citation>
<ArticleIdList><ArticleId IdType="pubmed">24580807</ArticleId>
</ArticleIdList>
</Reference>
<Reference><Citation>Nat Biotechnol. 2014 May;32(5):462-4</Citation>
<ArticleIdList><ArticleId IdType="pubmed">24752080</ArticleId>
</ArticleIdList>
</Reference>
<Reference><Citation>BMC Bioinformatics. 2014;15 Suppl 9:S7</Citation>
<ArticleIdList><ArticleId IdType="pubmed">25252952</ArticleId>
</ArticleIdList>
</Reference>
<Reference><Citation>Genome Biol. 2014;15(11):509</Citation>
<ArticleIdList><ArticleId IdType="pubmed">25398208</ArticleId>
</ArticleIdList>
</Reference>
<Reference><Citation>BMC Bioinformatics. 2015 Sep 14;16:288</Citation>
<ArticleIdList><ArticleId IdType="pubmed">26370285</ArticleId>
</ArticleIdList>
</Reference>
<Reference><Citation>Nat Biotechnol. 2016 Mar;34(3):300-2</Citation>
<ArticleIdList><ArticleId IdType="pubmed">26854477</ArticleId>
</ArticleIdList>
</Reference>
<Reference><Citation>Res Comput Mol Biol. 2014 Apr;8394:385-399</Citation>
<ArticleIdList><ArticleId IdType="pubmed">28825060</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
</PubmedData>
</pubmed>
<affiliations><list><country><li>Israël</li>
<li>États-Unis</li>
</country>
<region><li>Californie</li>
<li>Pennsylvanie</li>
</region>
</list>
<tree><country name="Israël"><noRegion><name sortKey="Pellow, David" sort="Pellow, David" uniqKey="Pellow D" first="David" last="Pellow">David Pellow</name>
</noRegion>
</country>
<country name="États-Unis"><region name="Californie"><name sortKey="Filippova, Darya" sort="Filippova, Darya" uniqKey="Filippova D" first="Darya" last="Filippova">Darya Filippova</name>
</region>
<name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/PubMed/Checkpoint
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000C62 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/PubMed/Checkpoint/biblio.hfd -nk 000C62 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Sante |area= MersV1 |flux= PubMed |étape= Checkpoint |type= RBID |clé= pubmed:27828710 |texte= Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters. }}
Pour générer des pages wiki
HfdIndexSelect -h $EXPLOR_AREA/Data/PubMed/Checkpoint/RBID.i -Sk "pubmed:27828710" \ | HfdSelect -Kh $EXPLOR_AREA/Data/PubMed/Checkpoint/biblio.hfd \ | NlmPubMed2Wicri -a MersV1
This area was generated with Dilib version V0.6.33. |