Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Efficient counting of k-mers in DNA sequences using a bloom filter.

Identifieur interne : 001E52 ( PubMed/Curation ); précédent : 001E51; suivant : 001E53

Efficient counting of k-mers in DNA sequences using a bloom filter.

Auteurs : Páll Melsted [États-Unis] ; Jonathan K. Pritchard

Source :

RBID : pubmed:21831268

Descripteurs français

English descriptors

Abstract

Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing k-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton k-mers are uninformative for many algorithms without some kind of error correction.

DOI: 10.1186/1471-2105-12-333
PubMed: 21831268

Links toward previous steps (curation, corpus...)


Links to Exploration step

pubmed:21831268

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Efficient counting of k-mers in DNA sequences using a bloom filter.</title>
<author>
<name sortKey="Melsted, Pall" sort="Melsted, Pall" uniqKey="Melsted P" first="Páll" last="Melsted">Páll Melsted</name>
<affiliation wicri:level="1">
<nlm:affiliation>Department of Human Genetics, The University of Chicago, Chicago, IL 60637, USA. pmelsted@gmail.com</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Human Genetics, The University of Chicago, Chicago, IL 60637</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Pritchard, Jonathan K" sort="Pritchard, Jonathan K" uniqKey="Pritchard J" first="Jonathan K" last="Pritchard">Jonathan K. Pritchard</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2011">2011</date>
<idno type="RBID">pubmed:21831268</idno>
<idno type="pmid">21831268</idno>
<idno type="doi">10.1186/1471-2105-12-333</idno>
<idno type="wicri:Area/PubMed/Corpus">001E52</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">001E52</idno>
<idno type="wicri:Area/PubMed/Curation">001E52</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">001E52</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Efficient counting of k-mers in DNA sequences using a bloom filter.</title>
<author>
<name sortKey="Melsted, Pall" sort="Melsted, Pall" uniqKey="Melsted P" first="Páll" last="Melsted">Páll Melsted</name>
<affiliation wicri:level="1">
<nlm:affiliation>Department of Human Genetics, The University of Chicago, Chicago, IL 60637, USA. pmelsted@gmail.com</nlm:affiliation>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Human Genetics, The University of Chicago, Chicago, IL 60637</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Pritchard, Jonathan K" sort="Pritchard, Jonathan K" uniqKey="Pritchard J" first="Jonathan K" last="Pritchard">Jonathan K. Pritchard</name>
</author>
</analytic>
<series>
<title level="j">BMC bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2011" type="published">2011</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Algorithms</term>
<term>Computational Biology (instrumentation)</term>
<term>Computational Biology (methods)</term>
<term>Computers</term>
<term>HapMap Project</term>
<term>Humans</term>
<term>Probability</term>
<term>Sequence Analysis, DNA (instrumentation)</term>
<term>Sequence Analysis, DNA (methods)</term>
<term>Software</term>
</keywords>
<keywords scheme="KwdFr" xml:lang="fr">
<term>Algorithmes</term>
<term>Analyse de séquence d'ADN ()</term>
<term>Analyse de séquence d'ADN (instrumentation)</term>
<term>Biologie informatique ()</term>
<term>Biologie informatique (instrumentation)</term>
<term>Humains</term>
<term>Logiciel</term>
<term>Ordinateurs</term>
<term>Probabilité</term>
<term>Projet HapMap</term>
</keywords>
<keywords scheme="MESH" qualifier="instrumentation" xml:lang="en">
<term>Computational Biology</term>
<term>Sequence Analysis, DNA</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en">
<term>Computational Biology</term>
<term>Sequence Analysis, DNA</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Algorithms</term>
<term>Computers</term>
<term>HapMap Project</term>
<term>Humans</term>
<term>Probability</term>
<term>Software</term>
</keywords>
<keywords scheme="MESH" qualifier="instrumentation" xml:lang="fr">
<term>Algorithmes</term>
<term>Analyse de séquence d'ADN</term>
<term>Biologie informatique</term>
<term>Humains</term>
<term>Logiciel</term>
<term>Ordinateurs</term>
<term>Probabilité</term>
<term>Projet HapMap</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing k-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton k-mers are uninformative for many algorithms without some kind of error correction.</div>
</front>
</TEI>
<pubmed>
<MedlineCitation Status="MEDLINE" IndexingMethod="Curated" Owner="NLM">
<PMID Version="1">21831268</PMID>
<DateCompleted>
<Year>2011</Year>
<Month>12</Month>
<Day>29</Day>
</DateCompleted>
<DateRevised>
<Year>2018</Year>
<Month>12</Month>
<Day>01</Day>
</DateRevised>
<Article PubModel="Electronic">
<Journal>
<ISSN IssnType="Electronic">1471-2105</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>12</Volume>
<PubDate>
<Year>2011</Year>
<Month>Aug</Month>
<Day>10</Day>
</PubDate>
</JournalIssue>
<Title>BMC bioinformatics</Title>
<ISOAbbreviation>BMC Bioinformatics</ISOAbbreviation>
</Journal>
<ArticleTitle>Efficient counting of k-mers in DNA sequences using a bloom filter.</ArticleTitle>
<Pagination>
<MedlinePgn>333</MedlinePgn>
</Pagination>
<ELocationID EIdType="doi" ValidYN="Y">10.1186/1471-2105-12-333</ELocationID>
<Abstract>
<AbstractText Label="BACKGROUND" NlmCategory="BACKGROUND">Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing k-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton k-mers are uninformative for many algorithms without some kind of error correction.</AbstractText>
<AbstractText Label="RESULTS" NlmCategory="RESULTS">We present a new method that identifies all the k-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed k-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique k-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting k-mers in sequence data with errors.</AbstractText>
<AbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS">A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at http://pritch.bsd.uchicago.edu/bfcounter.html.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Melsted</LastName>
<ForeName>Páll</ForeName>
<Initials>P</Initials>
<AffiliationInfo>
<Affiliation>Department of Human Genetics, The University of Chicago, Chicago, IL 60637, USA. pmelsted@gmail.com</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Pritchard</LastName>
<ForeName>Jonathan K</ForeName>
<Initials>JK</Initials>
</Author>
</AuthorList>
<Language>eng</Language>
<GrantList CompleteYN="Y">
<Grant>
<GrantID>MH084703</GrantID>
<Acronym>MH</Acronym>
<Agency>NIMH NIH HHS</Agency>
<Country>United States</Country>
</Grant>
<Grant>
<Agency>Howard Hughes Medical Institute</Agency>
<Country>United States</Country>
</Grant>
</GrantList>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
<PublicationType UI="D052061">Research Support, N.I.H., Extramural</PublicationType>
<PublicationType UI="D013485">Research Support, Non-U.S. Gov't</PublicationType>
</PublicationTypeList>
<ArticleDate DateType="Electronic">
<Year>2011</Year>
<Month>08</Month>
<Day>10</Day>
</ArticleDate>
</Article>
<MedlineJournalInfo>
<Country>England</Country>
<MedlineTA>BMC Bioinformatics</MedlineTA>
<NlmUniqueID>100965194</NlmUniqueID>
<ISSNLinking>1471-2105</ISSNLinking>
</MedlineJournalInfo>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D000465" MajorTopicYN="N">Algorithms</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D019295" MajorTopicYN="N">Computational Biology</DescriptorName>
<QualifierName UI="Q000295" MajorTopicYN="N">instrumentation</QualifierName>
<QualifierName UI="Q000379" MajorTopicYN="Y">methods</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D003201" MajorTopicYN="N">Computers</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D060148" MajorTopicYN="N">HapMap Project</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D006801" MajorTopicYN="N">Humans</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D011336" MajorTopicYN="Y">Probability</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D017422" MajorTopicYN="N">Sequence Analysis, DNA</DescriptorName>
<QualifierName UI="Q000295" MajorTopicYN="N">instrumentation</QualifierName>
<QualifierName UI="Q000379" MajorTopicYN="Y">methods</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D012984" MajorTopicYN="N">Software</DescriptorName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="received">
<Year>2011</Year>
<Month>05</Month>
<Day>14</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="accepted">
<Year>2011</Year>
<Month>08</Month>
<Day>10</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez">
<Year>2011</Year>
<Month>8</Month>
<Day>12</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed">
<Year>2011</Year>
<Month>8</Month>
<Day>13</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2011</Year>
<Month>12</Month>
<Day>30</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>epublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">21831268</ArticleId>
<ArticleId IdType="pii">1471-2105-12-333</ArticleId>
<ArticleId IdType="doi">10.1186/1471-2105-12-333</ArticleId>
<ArticleId IdType="pmc">PMC3166945</ArticleId>
</ArticleIdList>
<ReferenceList>
<Reference>
<Citation>Genome Biol. 2010;11(11):R116</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21114842</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>J VLSI Signal Process Syst Signal Image Video Technol. 2007;49(1):101-121</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18846267</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Res. 2008 May;18(5):810-20</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18340039</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nature. 2010 Jan 21;463(7279):311-7</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">20010809</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>PLoS One. 2008;3(10):e3376</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18852878</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Res. 2008 May;18(5):821-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18349386</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2011 Mar 15;27(6):764-70</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21217122</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2011 Feb 15;27(4):479-86</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21245053</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Proc Natl Acad Sci U S A. 2011 Jan 25;108(4):1513-8</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21187386</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Res. 2011 Apr;21(4):610-7</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21233398</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Res. 2009 Jun;19(6):1117-23</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">19251739</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>J Comput Biol. 2010 Apr;17(4):603-15</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">20426693</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Proc Natl Acad Sci U S A. 2001 Aug 14;98(17):9748-53</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">11504945</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2010 Jul 1;26(13):1595-600</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">20472541</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nature. 2010 Oct 28;467(7319):1061-73</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">20981092</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Res. 2010 Feb;20(2):265-72</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">20019144</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
</PubmedData>
</pubmed>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/PubMed/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001E52 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PubMed/Curation/biblio.hfd -nk 001E52 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    PubMed
   |étape=   Curation
   |type=    RBID
   |clé=     pubmed:21831268
   |texte=   Efficient counting of k-mers in DNA sequences using a bloom filter.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/PubMed/Curation/RBID.i   -Sk "pubmed:21831268" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/PubMed/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021