Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.

Identifieur interne : 001899 ( PubMed/Corpus ); précédent : 001898; suivant : 001900

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.

Auteurs : Qingpeng Zhang ; Jason Pell ; Rosangela Canino-Koning ; Adina Chuang Howe ; C Titus Brown

Source :

RBID : pubmed:25062443

English descriptors

Abstract

K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer.

DOI: 10.1371/journal.pone.0101271
PubMed: 25062443

Links to Exploration step

pubmed:25062443

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.</title>
<author>
<name sortKey="Zhang, Qingpeng" sort="Zhang, Qingpeng" uniqKey="Zhang Q" first="Qingpeng" last="Zhang">Qingpeng Zhang</name>
<affiliation>
<nlm:affiliation>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Pell, Jason" sort="Pell, Jason" uniqKey="Pell J" first="Jason" last="Pell">Jason Pell</name>
<affiliation>
<nlm:affiliation>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Canino Koning, Rosangela" sort="Canino Koning, Rosangela" uniqKey="Canino Koning R" first="Rosangela" last="Canino-Koning">Rosangela Canino-Koning</name>
<affiliation>
<nlm:affiliation>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Howe, Adina Chuang" sort="Howe, Adina Chuang" uniqKey="Howe A" first="Adina Chuang" last="Howe">Adina Chuang Howe</name>
<affiliation>
<nlm:affiliation>Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, Michigan, United States of America; Department of Plant, Soil, and Microbial Sciences, Michigan State University, East Lansing, Michigan, United States of America.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Brown, C Titus" sort="Brown, C Titus" uniqKey="Brown C" first="C Titus" last="Brown">C Titus Brown</name>
<affiliation>
<nlm:affiliation>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America; Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, Michigan, United States of America.</nlm:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2014">2014</date>
<idno type="RBID">pubmed:25062443</idno>
<idno type="pmid">25062443</idno>
<idno type="doi">10.1371/journal.pone.0101271</idno>
<idno type="wicri:Area/PubMed/Corpus">001899</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">001899</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.</title>
<author>
<name sortKey="Zhang, Qingpeng" sort="Zhang, Qingpeng" uniqKey="Zhang Q" first="Qingpeng" last="Zhang">Qingpeng Zhang</name>
<affiliation>
<nlm:affiliation>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Pell, Jason" sort="Pell, Jason" uniqKey="Pell J" first="Jason" last="Pell">Jason Pell</name>
<affiliation>
<nlm:affiliation>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Canino Koning, Rosangela" sort="Canino Koning, Rosangela" uniqKey="Canino Koning R" first="Rosangela" last="Canino-Koning">Rosangela Canino-Koning</name>
<affiliation>
<nlm:affiliation>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Howe, Adina Chuang" sort="Howe, Adina Chuang" uniqKey="Howe A" first="Adina Chuang" last="Howe">Adina Chuang Howe</name>
<affiliation>
<nlm:affiliation>Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, Michigan, United States of America; Department of Plant, Soil, and Microbial Sciences, Michigan State University, East Lansing, Michigan, United States of America.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Brown, C Titus" sort="Brown, C Titus" uniqKey="Brown C" first="C Titus" last="Brown">C Titus Brown</name>
<affiliation>
<nlm:affiliation>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America; Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, Michigan, United States of America.</nlm:affiliation>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PloS one</title>
<idno type="eISSN">1932-6203</idno>
<imprint>
<date when="2014" type="published">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Algorithms</term>
<term>Computational Biology</term>
<term>Humans</term>
<term>Nucleotides</term>
<term>Sequence Analysis, DNA</term>
<term>Software</term>
</keywords>
<keywords scheme="MESH" type="chemical" xml:lang="en">
<term>Nucleotides</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Algorithms</term>
<term>Computational Biology</term>
<term>Humans</term>
<term>Sequence Analysis, DNA</term>
<term>Software</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer. </div>
</front>
</TEI>
<pubmed>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">25062443</PMID>
<DateCompleted>
<Year>2015</Year>
<Month>04</Month>
<Day>16</Day>
</DateCompleted>
<DateRevised>
<Year>2019</Year>
<Month>02</Month>
<Day>23</Day>
</DateRevised>
<Article PubModel="Electronic-eCollection">
<Journal>
<ISSN IssnType="Electronic">1932-6203</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>9</Volume>
<Issue>7</Issue>
<PubDate>
<Year>2014</Year>
</PubDate>
</JournalIssue>
<Title>PloS one</Title>
<ISOAbbreviation>PLoS ONE</ISOAbbreviation>
</Journal>
<ArticleTitle>These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.</ArticleTitle>
<Pagination>
<MedlinePgn>e101271</MedlinePgn>
</Pagination>
<ELocationID EIdType="doi" ValidYN="Y">10.1371/journal.pone.0101271</ELocationID>
<Abstract>
<AbstractText>K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer. </AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Zhang</LastName>
<ForeName>Qingpeng</ForeName>
<Initials>Q</Initials>
<AffiliationInfo>
<Affiliation>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Pell</LastName>
<ForeName>Jason</ForeName>
<Initials>J</Initials>
<AffiliationInfo>
<Affiliation>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Canino-Koning</LastName>
<ForeName>Rosangela</ForeName>
<Initials>R</Initials>
<AffiliationInfo>
<Affiliation>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Howe</LastName>
<ForeName>Adina Chuang</ForeName>
<Initials>AC</Initials>
<AffiliationInfo>
<Affiliation>Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, Michigan, United States of America; Department of Plant, Soil, and Microbial Sciences, Michigan State University, East Lansing, Michigan, United States of America.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Brown</LastName>
<ForeName>C Titus</ForeName>
<Initials>CT</Initials>
<AffiliationInfo>
<Affiliation>Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America; Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, Michigan, United States of America.</Affiliation>
</AffiliationInfo>
</Author>
</AuthorList>
<Language>eng</Language>
<GrantList CompleteYN="Y">
<Grant>
<GrantID>R01 HG007513</GrantID>
<Acronym>HG</Acronym>
<Agency>NHGRI NIH HHS</Agency>
<Country>United States</Country>
</Grant>
<Grant>
<GrantID>1R01HG007513-01</GrantID>
<Acronym>HG</Acronym>
<Agency>NHGRI NIH HHS</Agency>
<Country>United States</Country>
</Grant>
</GrantList>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
<PublicationType UI="D052061">Research Support, N.I.H., Extramural</PublicationType>
<PublicationType UI="D013486">Research Support, U.S. Gov't, Non-P.H.S.</PublicationType>
</PublicationTypeList>
<ArticleDate DateType="Electronic">
<Year>2014</Year>
<Month>07</Month>
<Day>25</Day>
</ArticleDate>
</Article>
<MedlineJournalInfo>
<Country>United States</Country>
<MedlineTA>PLoS One</MedlineTA>
<NlmUniqueID>101285081</NlmUniqueID>
<ISSNLinking>1932-6203</ISSNLinking>
</MedlineJournalInfo>
<ChemicalList>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance UI="D009711">Nucleotides</NameOfSubstance>
</Chemical>
</ChemicalList>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D000465" MajorTopicYN="N">Algorithms</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D019295" MajorTopicYN="Y">Computational Biology</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D006801" MajorTopicYN="N">Humans</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D009711" MajorTopicYN="Y">Nucleotides</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D017422" MajorTopicYN="Y">Sequence Analysis, DNA</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D012984" MajorTopicYN="Y">Software</DescriptorName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="received">
<Year>2013</Year>
<Month>10</Month>
<Day>14</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="accepted">
<Year>2014</Year>
<Month>06</Month>
<Day>04</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez">
<Year>2014</Year>
<Month>7</Month>
<Day>26</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed">
<Year>2014</Year>
<Month>7</Month>
<Day>26</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2015</Year>
<Month>4</Month>
<Day>17</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>epublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">25062443</ArticleId>
<ArticleId IdType="doi">10.1371/journal.pone.0101271</ArticleId>
<ArticleId IdType="pii">PONE-D-13-45384</ArticleId>
<ArticleId IdType="pmc">PMC4111482</ArticleId>
</ArticleIdList>
<ReferenceList>
<Reference>
<Citation>Genome Biol. 2010;11(11):R116</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21114842</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>F1000Res. 2015 Sep 25;4:900</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">26535114</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Res. 2003 Aug;13(8):1916-22</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">12902383</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nucleic Acids Res. 2012 Dec;40(22):e171</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22904078</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2014 Jul 15;30(14):1950-7</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">24618471</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Res. 2008 May;18(5):821-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18349386</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nat Rev Genet. 2010 Jan;11(1):31-46</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">19997069</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nat Biotechnol. 2011 Oct;29(10):915-21</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21926975</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2011 Mar 15;27(6):764-70</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21217122</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nat Protoc. 2013 Aug;8(8):1494-512</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23845962</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2011 Feb 15;27(4):479-86</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21245053</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2014 Jul 15;30(14):2070-2</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">24642064</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Genomics. 2008;9:517</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18976482</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2013 Mar 1;29(5):652-3</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23325618</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Biol. 2011;12(11):R112</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22067484</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nature. 2010 Mar 4;464(7285):59-65</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">20203603</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Bioinformatics. 2013;14:160</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23679007</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Proc Natl Acad Sci U S A. 2001 Aug 14;98(17):9748-53</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">11504945</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2011 Jul 1;27(13):i137-41</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21685062</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Bioinformatics. 2011;12:333</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21831268</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2014 Jan 1;30(1):31-7</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23732276</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Proc Natl Acad Sci U S A. 2014 Apr 1;111(13):4904-9</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">24632729</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Proc Natl Acad Sci U S A. 2012 Aug 14;109(33):13272-7</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22847406</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Bioinformatics. 2009;10:161</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">19473525</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
</PubmedData>
</pubmed>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/PubMed/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001899 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PubMed/Corpus/biblio.hfd -nk 001899 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    PubMed
   |étape=   Corpus
   |type=    RBID
   |clé=     pubmed:25062443
   |texte=   These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/PubMed/Corpus/RBID.i   -Sk "pubmed:25062443" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/PubMed/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021