Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Disk-based k-mer counting on a PC.

Identifieur interne : 001C93 ( PubMed/Corpus ); précédent : 001C92; suivant : 001C94

Disk-based k-mer counting on a PC.

Auteurs : Sebastian Deorowicz ; Agnieszka Debudaj-Grabysz ; Szymon Grabowski

Source :

RBID : pubmed:23679007

English descriptors

Abstract

The k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection.

DOI: 10.1186/1471-2105-14-160
PubMed: 23679007

Links to Exploration step

pubmed:23679007

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Disk-based k-mer counting on a PC.</title>
<author>
<name sortKey="Deorowicz, Sebastian" sort="Deorowicz, Sebastian" uniqKey="Deorowicz S" first="Sebastian" last="Deorowicz">Sebastian Deorowicz</name>
<affiliation>
<nlm:affiliation>Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland. sebastian.deorowicz@polsl.pl</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Debudaj Grabysz, Agnieszka" sort="Debudaj Grabysz, Agnieszka" uniqKey="Debudaj Grabysz A" first="Agnieszka" last="Debudaj-Grabysz">Agnieszka Debudaj-Grabysz</name>
</author>
<author>
<name sortKey="Grabowski, Szymon" sort="Grabowski, Szymon" uniqKey="Grabowski S" first="Szymon" last="Grabowski">Szymon Grabowski</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2013">2013</date>
<idno type="RBID">pubmed:23679007</idno>
<idno type="pmid">23679007</idno>
<idno type="doi">10.1186/1471-2105-14-160</idno>
<idno type="wicri:Area/PubMed/Corpus">001C93</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">001C93</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Disk-based k-mer counting on a PC.</title>
<author>
<name sortKey="Deorowicz, Sebastian" sort="Deorowicz, Sebastian" uniqKey="Deorowicz S" first="Sebastian" last="Deorowicz">Sebastian Deorowicz</name>
<affiliation>
<nlm:affiliation>Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland. sebastian.deorowicz@polsl.pl</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Debudaj Grabysz, Agnieszka" sort="Debudaj Grabysz, Agnieszka" uniqKey="Debudaj Grabysz A" first="Agnieszka" last="Debudaj-Grabysz">Agnieszka Debudaj-Grabysz</name>
</author>
<author>
<name sortKey="Grabowski, Szymon" sort="Grabowski, Szymon" uniqKey="Grabowski S" first="Szymon" last="Grabowski">Szymon Grabowski</name>
</author>
</analytic>
<series>
<title level="j">BMC bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2013" type="published">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Algorithms</term>
<term>Animals</term>
<term>Caenorhabditis elegans (genetics)</term>
<term>Genome, Human</term>
<term>Genomics (methods)</term>
<term>Humans</term>
<term>Microcomputers</term>
<term>Sequence Alignment</term>
<term>Sequence Analysis, DNA</term>
<term>Software</term>
</keywords>
<keywords scheme="MESH" qualifier="genetics" xml:lang="en">
<term>Caenorhabditis elegans</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en">
<term>Genomics</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Algorithms</term>
<term>Animals</term>
<term>Genome, Human</term>
<term>Humans</term>
<term>Microcomputers</term>
<term>Sequence Alignment</term>
<term>Sequence Analysis, DNA</term>
<term>Software</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">The k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection.</div>
</front>
</TEI>
<pubmed>
<MedlineCitation Status="MEDLINE" IndexingMethod="Curated" Owner="NLM">
<PMID Version="1">23679007</PMID>
<DateCompleted>
<Year>2013</Year>
<Month>11</Month>
<Day>05</Day>
</DateCompleted>
<DateRevised>
<Year>2018</Year>
<Month>12</Month>
<Day>02</Day>
</DateRevised>
<Article PubModel="Electronic">
<Journal>
<ISSN IssnType="Electronic">1471-2105</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>14</Volume>
<PubDate>
<Year>2013</Year>
<Month>May</Month>
<Day>16</Day>
</PubDate>
</JournalIssue>
<Title>BMC bioinformatics</Title>
<ISOAbbreviation>BMC Bioinformatics</ISOAbbreviation>
</Journal>
<ArticleTitle>Disk-based k-mer counting on a PC.</ArticleTitle>
<Pagination>
<MedlinePgn>160</MedlinePgn>
</Pagination>
<ELocationID EIdType="doi" ValidYN="Y">10.1186/1471-2105-14-160</ELocationID>
<Abstract>
<AbstractText Label="BACKGROUND" NlmCategory="BACKGROUND">The k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection.</AbstractText>
<AbstractText Label="RESULTS" NlmCategory="RESULTS">We propose a simple, yet efficient, parallel disk-based algorithm for counting k-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data.</AbstractText>
<AbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS">By making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive k-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Deorowicz</LastName>
<ForeName>Sebastian</ForeName>
<Initials>S</Initials>
<AffiliationInfo>
<Affiliation>Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland. sebastian.deorowicz@polsl.pl</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Debudaj-Grabysz</LastName>
<ForeName>Agnieszka</ForeName>
<Initials>A</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Grabowski</LastName>
<ForeName>Szymon</ForeName>
<Initials>S</Initials>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
<PublicationType UI="D013485">Research Support, Non-U.S. Gov't</PublicationType>
</PublicationTypeList>
<ArticleDate DateType="Electronic">
<Year>2013</Year>
<Month>05</Month>
<Day>16</Day>
</ArticleDate>
</Article>
<MedlineJournalInfo>
<Country>England</Country>
<MedlineTA>BMC Bioinformatics</MedlineTA>
<NlmUniqueID>100965194</NlmUniqueID>
<ISSNLinking>1471-2105</ISSNLinking>
</MedlineJournalInfo>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D000465" MajorTopicYN="Y">Algorithms</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D000818" MajorTopicYN="N">Animals</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D017173" MajorTopicYN="N">Caenorhabditis elegans</DescriptorName>
<QualifierName UI="Q000235" MajorTopicYN="N">genetics</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D015894" MajorTopicYN="N">Genome, Human</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D023281" MajorTopicYN="N">Genomics</DescriptorName>
<QualifierName UI="Q000379" MajorTopicYN="Y">methods</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D006801" MajorTopicYN="N">Humans</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D008838" MajorTopicYN="Y">Microcomputers</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D016415" MajorTopicYN="N">Sequence Alignment</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D017422" MajorTopicYN="N">Sequence Analysis, DNA</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D012984" MajorTopicYN="N">Software</DescriptorName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="received">
<Year>2012</Year>
<Month>11</Month>
<Day>29</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="accepted">
<Year>2013</Year>
<Month>05</Month>
<Day>03</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez">
<Year>2013</Year>
<Month>5</Month>
<Day>18</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed">
<Year>2013</Year>
<Month>5</Month>
<Day>18</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2013</Year>
<Month>11</Month>
<Day>6</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>epublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">23679007</ArticleId>
<ArticleId IdType="pii">1471-2105-14-160</ArticleId>
<ArticleId IdType="doi">10.1186/1471-2105-14-160</ArticleId>
<ArticleId IdType="pmc">PMC3680041</ArticleId>
</ArticleIdList>
<ReferenceList>
<Reference>
<Citation>Nucleic Acids Res. 2004;32(5):1792-7</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">15034147</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2008 Dec 15;24(24):2818-24</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18952627</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Genomics. 2008;9:517</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">18976482</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Genome Biol. 2010;11(11):R116</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21114842</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2013 Mar 1;29(5):652-3</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">23325618</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>BMC Bioinformatics. 2011;12:333</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21831268</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Nat Biotechnol. 2011 Nov;29(11):987-91</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22068540</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Proc Natl Acad Sci U S A. 2012 Aug 14;109(33):13272-7</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">22847406</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Bioinformatics. 2011 Mar 15;27(6):764-70</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">21217122</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
</PubmedData>
</pubmed>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/PubMed/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001C93 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PubMed/Corpus/biblio.hfd -nk 001C93 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    PubMed
   |étape=   Corpus
   |type=    RBID
   |clé=     pubmed:23679007
   |texte=   Disk-based k-mer counting on a PC.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/PubMed/Corpus/RBID.i   -Sk "pubmed:23679007" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/PubMed/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021