Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.

Identifieur interne : 000C62 ( PubMed/Checkpoint ); précédent : 000C61; suivant : 000C63

Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.

Auteurs : David Pellow [Israël] ; Darya Filippova [États-Unis] ; Carl Kingsford [États-Unis]

Source :

RBID : pubmed:27828710

Descripteurs français

English descriptors

Abstract

Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 - 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.

DOI: 10.1089/cmb.2016.0155
PubMed: 27828710


Affiliations:


Links toward previous steps (curation, corpus...)


Links to Exploration step

pubmed:27828710

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.</title>
<author>
<name sortKey="Pellow, David" sort="Pellow, David" uniqKey="Pellow D" first="David" last="Pellow">David Pellow</name>
<affiliation wicri:level="1">
<nlm:affiliation>1 The Blavatnik School of Computer Science, Tel Aviv University , Tel Aviv, Israel .</nlm:affiliation>
<country xml:lang="fr">Israël</country>
<wicri:regionArea>1 The Blavatnik School of Computer Science, Tel Aviv University , Tel Aviv</wicri:regionArea>
<wicri:noRegion>Tel Aviv</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Filippova, Darya" sort="Filippova, Darya" uniqKey="Filippova D" first="Darya" last="Filippova">Darya Filippova</name>
<affiliation wicri:level="2">
<nlm:affiliation>2 Roche Sequencing Solutions , Pleasanton, California.</nlm:affiliation>
<country>États-Unis</country>
<placeName>
<region type="state">Californie</region>
</placeName>
<wicri:cityArea>2 Roche Sequencing Solutions , Pleasanton</wicri:cityArea>
</affiliation>
</author>
<author>
<name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<affiliation wicri:level="2">
<nlm:affiliation>3 Computational Biology Department, School of Computer Science, Carnegie Mellon University , Pittsburgh, Pennsylvania.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
</placeName>
<wicri:cityArea>3 Computational Biology Department, School of Computer Science, Carnegie Mellon University , Pittsburgh</wicri:cityArea>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2017">2017</date>
<idno type="RBID">pubmed:27828710</idno>
<idno type="pmid">27828710</idno>
<idno type="doi">10.1089/cmb.2016.0155</idno>
<idno type="wicri:Area/PubMed/Corpus">000E99</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">000E99</idno>
<idno type="wicri:Area/PubMed/Curation">000E99</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">000E99</idno>
<idno type="wicri:Area/PubMed/Checkpoint">000C62</idno>
<idno type="wicri:explorRef" wicri:stream="Checkpoint" wicri:step="PubMed">000C62</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.</title>
<author>
<name sortKey="Pellow, David" sort="Pellow, David" uniqKey="Pellow D" first="David" last="Pellow">David Pellow</name>
<affiliation wicri:level="1">
<nlm:affiliation>1 The Blavatnik School of Computer Science, Tel Aviv University , Tel Aviv, Israel .</nlm:affiliation>
<country xml:lang="fr">Israël</country>
<wicri:regionArea>1 The Blavatnik School of Computer Science, Tel Aviv University , Tel Aviv</wicri:regionArea>
<wicri:noRegion>Tel Aviv</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Filippova, Darya" sort="Filippova, Darya" uniqKey="Filippova D" first="Darya" last="Filippova">Darya Filippova</name>
<affiliation wicri:level="2">
<nlm:affiliation>2 Roche Sequencing Solutions , Pleasanton, California.</nlm:affiliation>
<country>États-Unis</country>
<placeName>
<region type="state">Californie</region>
</placeName>
<wicri:cityArea>2 Roche Sequencing Solutions , Pleasanton</wicri:cityArea>
</affiliation>
</author>
<author>
<name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<affiliation wicri:level="2">
<nlm:affiliation>3 Computational Biology Department, School of Computer Science, Carnegie Mellon University , Pittsburgh, Pennsylvania.</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
</placeName>
<wicri:cityArea>3 Computational Biology Department, School of Computer Science, Carnegie Mellon University , Pittsburgh</wicri:cityArea>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Journal of computational biology : a journal of computational molecular cell biology</title>
<idno type="eISSN">1557-8666</idno>
<imprint>
<date when="2017" type="published">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Algorithms</term>
<term>Computational Biology (methods)</term>
<term>Computer Simulation</term>
<term>Humans</term>
<term>Probability</term>
<term>Sequence Analysis, DNA (methods)</term>
<term>Software</term>
</keywords>
<keywords scheme="KwdFr" xml:lang="fr">
<term>Algorithmes</term>
<term>Analyse de séquence d'ADN ()</term>
<term>Biologie informatique ()</term>
<term>Humains</term>
<term>Logiciel</term>
<term>Probabilité</term>
<term>Simulation numérique</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en">
<term>Computational Biology</term>
<term>Sequence Analysis, DNA</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Algorithms</term>
<term>Computer Simulation</term>
<term>Humans</term>
<term>Probability</term>
<term>Software</term>
</keywords>
<keywords scheme="MESH" xml:lang="fr">
<term>Algorithmes</term>
<term>Analyse de séquence d'ADN</term>
<term>Biologie informatique</term>
<term>Humains</term>
<term>Logiciel</term>
<term>Probabilité</term>
<term>Simulation numérique</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 - 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.</div>
</front>
</TEI>
<pubmed>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">27828710</PMID>
<DateCompleted>
<Year>2017</Year>
<Month>11</Month>
<Day>16</Day>
</DateCompleted>
<DateRevised>
<Year>2020</Year>
<Month>04</Month>
<Day>15</Day>
</DateRevised>
<Article PubModel="Print-Electronic">
<Journal>
<ISSN IssnType="Electronic">1557-8666</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>24</Volume>
<Issue>6</Issue>
<PubDate>
<Year>2017</Year>
<Month>Jun</Month>
</PubDate>
</JournalIssue>
<Title>Journal of computational biology : a journal of computational molecular cell biology</Title>
<ISOAbbreviation>J. Comput. Biol.</ISOAbbreviation>
</Journal>
<ArticleTitle>Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.</ArticleTitle>
<Pagination>
<MedlinePgn>547-557</MedlinePgn>
</Pagination>
<ELocationID EIdType="doi" ValidYN="Y">10.1089/cmb.2016.0155</ELocationID>
<Abstract>
<AbstractText>Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 - 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Pellow</LastName>
<ForeName>David</ForeName>
<Initials>D</Initials>
<AffiliationInfo>
<Affiliation>1 The Blavatnik School of Computer Science, Tel Aviv University , Tel Aviv, Israel .</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Filippova</LastName>
<ForeName>Darya</ForeName>
<Initials>D</Initials>
<AffiliationInfo>
<Affiliation>2 Roche Sequencing Solutions , Pleasanton, California.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Kingsford</LastName>
<ForeName>Carl</ForeName>
<Initials>C</Initials>
<AffiliationInfo>
<Affiliation>3 Computational Biology Department, School of Computer Science, Carnegie Mellon University , Pittsburgh, Pennsylvania.</Affiliation>
</AffiliationInfo>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
<ArticleDate DateType="Electronic">
<Year>2016</Year>
<Month>11</Month>
<Day>09</Day>
</ArticleDate>
</Article>
<MedlineJournalInfo>
<Country>United States</Country>
<MedlineTA>J Comput Biol</MedlineTA>
<NlmUniqueID>9433358</NlmUniqueID>
<ISSNLinking>1066-5277</ISSNLinking>
</MedlineJournalInfo>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D000465" MajorTopicYN="Y">Algorithms</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D019295" MajorTopicYN="N">Computational Biology</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="Y">methods</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D003198" MajorTopicYN="N">Computer Simulation</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D006801" MajorTopicYN="N">Humans</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D011336" MajorTopicYN="N">Probability</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D017422" MajorTopicYN="N">Sequence Analysis, DNA</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="Y">methods</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D012984" MajorTopicYN="N">Software</DescriptorName>
</MeshHeading>
</MeshHeadingList>
<KeywordList Owner="NOTNLM">
<Keyword MajorTopicYN="N">Bloom fitters</Keyword>
<Keyword MajorTopicYN="N">efficient data structures</Keyword>
<Keyword MajorTopicYN="N">genomics</Keyword>
<Keyword MajorTopicYN="N">k-mers.</Keyword>
<Keyword MajorTopicYN="N">string algorithms</Keyword>
</KeywordList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="pubmed">
<Year>2016</Year>
<Month>11</Month>
<Day>10</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2017</Year>
<Month>11</Month>
<Day>29</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez">
<Year>2016</Year>
<Month>11</Month>
<Day>10</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">27828710</ArticleId>
<ArticleId IdType="doi">10.1089/cmb.2016.0155</ArticleId>
<ArticleId IdType="pmc">PMC5467106</ArticleId>
</ArticleIdList>
<ReferenceList>
<Reference>
<Citation>Genome Res. 2008 May;18(5):821-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18349386</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2010 Jul 1;26(13):1595-600</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">20472541</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2011 Mar 15;27(6):764-70</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21217122</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Proc Natl Acad Sci U S A. 2012 Aug 14;109(33):13272-7</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22847406</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Algorithms Mol Biol. 2013 Sep 16;8(1):22</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">24040893</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2014 May 15;30(10):1354-62</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">24451628</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Algorithms Mol Biol. 2014 Feb 24;9(1):2</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">24565280</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Biol. 2014 Mar 03;15(3):R46</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">24580807</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nat Biotechnol. 2014 May;32(5):462-4</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">24752080</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Bioinformatics. 2014;15 Suppl 9:S7</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">25252952</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Biol. 2014;15(11):509</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">25398208</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Bioinformatics. 2015 Sep 14;16:288</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">26370285</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nat Biotechnol. 2016 Mar;34(3):300-2</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">26854477</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Res Comput Mol Biol. 2014 Apr;8394:385-399</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">28825060</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
</PubmedData>
</pubmed>
<affiliations>
<list>
<country>
<li>Israël</li>
<li>États-Unis</li>
</country>
<region>
<li>Californie</li>
<li>Pennsylvanie</li>
</region>
</list>
<tree>
<country name="Israël">
<noRegion>
<name sortKey="Pellow, David" sort="Pellow, David" uniqKey="Pellow D" first="David" last="Pellow">David Pellow</name>
</noRegion>
</country>
<country name="États-Unis">
<region name="Californie">
<name sortKey="Filippova, Darya" sort="Filippova, Darya" uniqKey="Filippova D" first="Darya" last="Filippova">Darya Filippova</name>
</region>
<name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/PubMed/Checkpoint
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000C62 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PubMed/Checkpoint/biblio.hfd -nk 000C62 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    PubMed
   |étape=   Checkpoint
   |type=    RBID
   |clé=     pubmed:27828710
   |texte=   Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/PubMed/Checkpoint/RBID.i   -Sk "pubmed:27828710" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/PubMed/Checkpoint/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021