Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

ntCard: a streaming algorithm for cardinality estimation in genomics data

Identifieur interne : 000667 ( Pmc/Checkpoint ); précédent : 000666; suivant : 000668

ntCard: a streaming algorithm for cardinality estimation in genomics data

Auteurs : Hamid Mohamadi [Canada] ; Hamza Khan [Canada] ; Inanc Birol [Canada]

Source :

RBID : PMC:5408799

Abstract

AbstractMotivation

Many bioinformatics algorithms are designed for the analysis of sequences of some uniform length, conventionally referred to as k-mers. These include de Bruijn graph assembly methods and sequence alignment tools. An efficient algorithm to enumerate the number of unique k-mers, or even better, to build a histogram of k-mer frequencies would be desirable for these tools and their downstream analysis pipelines. Among other applications, estimated frequencies can be used to predict genome sizes, measure sequencing error rates, and tune runtime parameters for analysis tools. However, calculating a k-mer histogram from large volumes of sequencing data is a challenging task.

Results

Here, we present ntCard, a streaming algorithm for estimating the frequencies of k-mers in genomics datasets. At its core, ntCard uses the ntHash algorithm to efficiently compute hash values for streamed sequences. It then samples the calculated hash values to build a reduced representation multiplicity table describing the sample distribution. Finally, it uses a statistical model to reconstruct the population distribution from the sample distribution. We have compared the performance of ntCard and other cardinality estimation algorithms. We used three datasets of 480 GB, 500 GB and 2.4 TB in size, where the first two representing whole genome shotgun sequencing experiments on the human genome and the last one on the white spruce genome. Results show ntCard estimates k-mer coverage frequencies >15× faster than the state-of-the-art algorithms, using similar amount of memory, and with higher accuracy rates. Thus, our benchmarks demonstrate ntCard as a potentially enabling technology for large-scale genomics applications.

Availability and Implementation

ntCard is written in C ++ and is released under the GPL license. It is freely available at https://github.com/bcgsc/ntCard.

Supplementary information

Supplementary data are available at Bioinformatics online.


Url:
DOI: 10.1093/bioinformatics/btw832
PubMed: 28453674
PubMed Central: 5408799


Affiliations:


Links toward previous steps (curation, corpus...)


Links to Exploration step

PMC:5408799

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">ntCard: a streaming algorithm for cardinality estimation in genomics data</title>
<author>
<name sortKey="Mohamadi, Hamid" sort="Mohamadi, Hamid" uniqKey="Mohamadi H" first="Hamid" last="Mohamadi">Hamid Mohamadi</name>
<affiliation wicri:level="1">
<nlm:aff id="btw832-aff1">Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada</nlm:aff>
<country xml:lang="fr">Canada</country>
<wicri:regionArea>Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC</wicri:regionArea>
<wicri:noRegion>BC</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="btw832-aff2">Faculty of Science, University of British Columbia, Vancouver, BC, Canada</nlm:aff>
<country xml:lang="fr">Canada</country>
<wicri:regionArea>Faculty of Science, University of British Columbia, Vancouver, BC</wicri:regionArea>
<wicri:noRegion>BC</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Khan, Hamza" sort="Khan, Hamza" uniqKey="Khan H" first="Hamza" last="Khan">Hamza Khan</name>
<affiliation wicri:level="1">
<nlm:aff id="btw832-aff1">Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada</nlm:aff>
<country xml:lang="fr">Canada</country>
<wicri:regionArea>Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC</wicri:regionArea>
<wicri:noRegion>BC</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="btw832-aff2">Faculty of Science, University of British Columbia, Vancouver, BC, Canada</nlm:aff>
<country xml:lang="fr">Canada</country>
<wicri:regionArea>Faculty of Science, University of British Columbia, Vancouver, BC</wicri:regionArea>
<wicri:noRegion>BC</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Birol, Inanc" sort="Birol, Inanc" uniqKey="Birol I" first="Inanc" last="Birol">Inanc Birol</name>
<affiliation wicri:level="1">
<nlm:aff id="btw832-aff1">Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada</nlm:aff>
<country xml:lang="fr">Canada</country>
<wicri:regionArea>Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC</wicri:regionArea>
<wicri:noRegion>BC</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="btw832-aff2">Faculty of Science, University of British Columbia, Vancouver, BC, Canada</nlm:aff>
<country xml:lang="fr">Canada</country>
<wicri:regionArea>Faculty of Science, University of British Columbia, Vancouver, BC</wicri:regionArea>
<wicri:noRegion>BC</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">28453674</idno>
<idno type="pmc">5408799</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5408799</idno>
<idno type="RBID">PMC:5408799</idno>
<idno type="doi">10.1093/bioinformatics/btw832</idno>
<date when="2017">2017</date>
<idno type="wicri:Area/Pmc/Corpus">000B17</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000B17</idno>
<idno type="wicri:Area/Pmc/Curation">000B17</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">000B17</idno>
<idno type="wicri:Area/Pmc/Checkpoint">000667</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Checkpoint">000667</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">ntCard: a streaming algorithm for cardinality estimation in genomics data</title>
<author>
<name sortKey="Mohamadi, Hamid" sort="Mohamadi, Hamid" uniqKey="Mohamadi H" first="Hamid" last="Mohamadi">Hamid Mohamadi</name>
<affiliation wicri:level="1">
<nlm:aff id="btw832-aff1">Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada</nlm:aff>
<country xml:lang="fr">Canada</country>
<wicri:regionArea>Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC</wicri:regionArea>
<wicri:noRegion>BC</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="btw832-aff2">Faculty of Science, University of British Columbia, Vancouver, BC, Canada</nlm:aff>
<country xml:lang="fr">Canada</country>
<wicri:regionArea>Faculty of Science, University of British Columbia, Vancouver, BC</wicri:regionArea>
<wicri:noRegion>BC</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Khan, Hamza" sort="Khan, Hamza" uniqKey="Khan H" first="Hamza" last="Khan">Hamza Khan</name>
<affiliation wicri:level="1">
<nlm:aff id="btw832-aff1">Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada</nlm:aff>
<country xml:lang="fr">Canada</country>
<wicri:regionArea>Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC</wicri:regionArea>
<wicri:noRegion>BC</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="btw832-aff2">Faculty of Science, University of British Columbia, Vancouver, BC, Canada</nlm:aff>
<country xml:lang="fr">Canada</country>
<wicri:regionArea>Faculty of Science, University of British Columbia, Vancouver, BC</wicri:regionArea>
<wicri:noRegion>BC</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Birol, Inanc" sort="Birol, Inanc" uniqKey="Birol I" first="Inanc" last="Birol">Inanc Birol</name>
<affiliation wicri:level="1">
<nlm:aff id="btw832-aff1">Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada</nlm:aff>
<country xml:lang="fr">Canada</country>
<wicri:regionArea>Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC</wicri:regionArea>
<wicri:noRegion>BC</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="btw832-aff2">Faculty of Science, University of British Columbia, Vancouver, BC, Canada</nlm:aff>
<country xml:lang="fr">Canada</country>
<wicri:regionArea>Faculty of Science, University of British Columbia, Vancouver, BC</wicri:regionArea>
<wicri:noRegion>BC</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Bioinformatics</title>
<idno type="ISSN">1367-4803</idno>
<idno type="eISSN">1367-4811</idno>
<imprint>
<date when="2017">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<title>Abstract</title>
<sec id="SA1">
<title>Motivation</title>
<p>Many bioinformatics algorithms are designed for the analysis of sequences of some uniform length, conventionally referred to as
<italic>k</italic>
-mers. These include de Bruijn graph assembly methods and sequence alignment tools. An efficient algorithm to enumerate the number of unique
<italic>k</italic>
-mers, or even better, to build a histogram of
<italic>k</italic>
-mer frequencies would be desirable for these tools and their downstream analysis pipelines. Among other applications, estimated frequencies can be used to predict genome sizes, measure sequencing error rates, and tune runtime parameters for analysis tools. However, calculating a
<italic>k</italic>
-mer histogram from large volumes of sequencing data is a challenging task.</p>
</sec>
<sec id="SA2">
<title>Results</title>
<p>Here, we present ntCard, a streaming algorithm for estimating the frequencies of
<italic>k</italic>
-mers in genomics datasets. At its core, ntCard uses the ntHash algorithm to efficiently compute hash values for streamed sequences. It then samples the calculated hash values to build a reduced representation multiplicity table describing the sample distribution. Finally, it uses a statistical model to reconstruct the population distribution from the sample distribution. We have compared the performance of ntCard and other cardinality estimation algorithms. We used three datasets of 480 GB, 500 GB and 2.4 TB in size, where the first two representing whole genome shotgun sequencing experiments on the human genome and the last one on the white spruce genome. Results show ntCard estimates
<italic>k</italic>
-mer coverage frequencies >15× faster than the state-of-the-art algorithms, using similar amount of memory, and with higher accuracy rates. Thus, our benchmarks demonstrate ntCard as a potentially enabling technology for large-scale genomics applications.</p>
</sec>
<sec id="SA3">
<title>Availability and Implementation</title>
<p>ntCard is written in C ++ and is released under the GPL license. It is freely available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/bcgsc/ntCard">https://github.com/bcgsc/ntCard</ext-link>
.</p>
</sec>
<sec id="SA4">
<title>Supplementary information</title>
<p>
<xref ref-type="supplementary-material" rid="sup1">Supplementary data</xref>
are available at
<italic>Bioinformatics</italic>
online.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Alon, N" uniqKey="Alon N">N. Alon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bar Yossef, Z" uniqKey="Bar Yossef Z">Z. Bar-Yossef</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Butler, J" uniqKey="Butler J">J. Butler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R. Chikhi</name>
</author>
<author>
<name sortKey="Medvedev, P" uniqKey="Medvedev P">P. Medvedev</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chu, J" uniqKey="Chu J">J. Chu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Conway, T C" uniqKey="Conway T">T.C. Conway</name>
</author>
<author>
<name sortKey="Bromage, A J" uniqKey="Bromage A">A.J. Bromage</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cormode, G" uniqKey="Cormode G">G. Cormode</name>
</author>
<author>
<name sortKey="Garofalakis, M" uniqKey="Garofalakis M">M. Garofalakis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cormode, G" uniqKey="Cormode G">G. Cormode</name>
</author>
<author>
<name sortKey="Muthukrishnan, S" uniqKey="Muthukrishnan S">S. Muthukrishnan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S. Deorowicz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Edgar, R C" uniqKey="Edgar R">R.C. Edgar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Flajolet, P" uniqKey="Flajolet P">P. Flajolet</name>
</author>
<author>
<name sortKey="Martin, G N" uniqKey="Martin G">G.N. Martin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Heo, Y" uniqKey="Heo Y">Y. Heo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Indyk, P" uniqKey="Indyk P">P. Indyk</name>
</author>
<author>
<name sortKey="Woodruff, D" uniqKey="Woodruff D">D. Woodruff</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Irber Junior, L C" uniqKey="Irber Junior L">L.C. Irber Junior</name>
</author>
<author>
<name sortKey="Brown, C T" uniqKey="Brown C">C.T. Brown</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jackman, S D" uniqKey="Jackman S">S.D. Jackman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, R" uniqKey="Li R">R. Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G. Marçais</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C. Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Medvedev, P" uniqKey="Medvedev P">P. Medvedev</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Melsted, P" uniqKey="Melsted P">P. Melsted</name>
</author>
<author>
<name sortKey="Halld Rsson, B V" uniqKey="Halld Rsson B">B.V. Halldórsson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Melsted, P" uniqKey="Melsted P">P. Melsted</name>
</author>
<author>
<name sortKey="Pritchard, J K" uniqKey="Pritchard J">J.K. Pritchard</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mohamadi, H" uniqKey="Mohamadi H">H. Mohamadi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nattestad, M" uniqKey="Nattestad M">M. Nattestad</name>
</author>
<author>
<name sortKey="Schatz, M C" uniqKey="Schatz M">M.C. Schatz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Patro, R" uniqKey="Patro R">R. Patro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rizk, G" uniqKey="Rizk G">G. Rizk</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salzberg, S L" uniqKey="Salzberg S">S.L. Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shajii, A" uniqKey="Shajii A">A. Shajii</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Simpson, J T" uniqKey="Simpson J">J.T. Simpson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Simpson, J T" uniqKey="Simpson J">J.T. Simpson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Warren, R L" uniqKey="Warren R">R.L. Warren</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zerbino, D R" uniqKey="Zerbino D">D.R. Zerbino</name>
</author>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E. Birney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zook, J M" uniqKey="Zook J">J.M. Zook</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">Bioinformatics</journal-id>
<journal-id journal-id-type="publisher-id">bioinformatics</journal-id>
<journal-title-group>
<journal-title>Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="ppub">1367-4803</issn>
<issn pub-type="epub">1367-4811</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">28453674</article-id>
<article-id pub-id-type="pmc">5408799</article-id>
<article-id pub-id-type="doi">10.1093/bioinformatics/btw832</article-id>
<article-id pub-id-type="publisher-id">btw832</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Original Papers</subject>
<subj-group subj-group-type="category-toc-heading">
<subject>Sequence Analysis</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>ntCard: a streaming algorithm for cardinality estimation in genomics data</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Mohamadi</surname>
<given-names>Hamid</given-names>
</name>
<xref ref-type="aff" rid="btw832-aff1">1</xref>
<xref ref-type="aff" rid="btw832-aff2">2</xref>
<xref ref-type="corresp" rid="btw832-cor1"></xref>
<pmc-comment>hmohamadi@bcgsc.ca</pmc-comment>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Khan</surname>
<given-names>Hamza</given-names>
</name>
<xref ref-type="aff" rid="btw832-aff1">1</xref>
<xref ref-type="aff" rid="btw832-aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Birol</surname>
<given-names>Inanc</given-names>
</name>
<xref ref-type="aff" rid="btw832-aff1">1</xref>
<xref ref-type="aff" rid="btw832-aff2">2</xref>
<xref ref-type="corresp" rid="btw832-cor1"></xref>
<pmc-comment>ibirol@bcgsc.ca</pmc-comment>
</contrib>
</contrib-group>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Berger</surname>
<given-names>Bonnie</given-names>
</name>
<role>Associate Editor</role>
</contrib>
</contrib-group>
<aff id="btw832-aff1">
<label>1</label>
Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada</aff>
<aff id="btw832-aff2">
<label>2</label>
Faculty of Science, University of British Columbia, Vancouver, BC, Canada</aff>
<author-notes>
<corresp id="btw832-cor1">To whom correspondence should be addressed. Email:
<email>hmohamadi@bcgsc.ca</email>
or
<email>ibirol@bcgsc.ca</email>
</corresp>
</author-notes>
<pub-date pub-type="ppub">
<day>01</day>
<month>5</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="epub" iso-8601-date="2017-01-05">
<day>05</day>
<month>1</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>05</day>
<month>1</month>
<year>2017</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>33</volume>
<issue>9</issue>
<fpage>1324</fpage>
<lpage>1330</lpage>
<history>
<date date-type="received">
<day>31</day>
<month>10</month>
<year>2016</year>
</date>
<date date-type="rev-recd">
<day>21</day>
<month>12</month>
<year>2016</year>
</date>
<date date-type="accepted">
<day>27</day>
<month>12</month>
<year>2016</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author 2017. Published by Oxford University Press.</copyright-statement>
<copyright-year>2017</copyright-year>
<license xlink:href="http://creativecommons.org/licenses/by-nc/4.0/" license-type="cc-by-nc">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/4.0/">http://creativecommons.org/licenses/by-nc/4.0/</ext-link>
), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com</license-p>
</license>
</permissions>
<self-uri xlink:href="btw832.pdf"></self-uri>
<abstract>
<title>Abstract</title>
<sec id="SA1">
<title>Motivation</title>
<p>Many bioinformatics algorithms are designed for the analysis of sequences of some uniform length, conventionally referred to as
<italic>k</italic>
-mers. These include de Bruijn graph assembly methods and sequence alignment tools. An efficient algorithm to enumerate the number of unique
<italic>k</italic>
-mers, or even better, to build a histogram of
<italic>k</italic>
-mer frequencies would be desirable for these tools and their downstream analysis pipelines. Among other applications, estimated frequencies can be used to predict genome sizes, measure sequencing error rates, and tune runtime parameters for analysis tools. However, calculating a
<italic>k</italic>
-mer histogram from large volumes of sequencing data is a challenging task.</p>
</sec>
<sec id="SA2">
<title>Results</title>
<p>Here, we present ntCard, a streaming algorithm for estimating the frequencies of
<italic>k</italic>
-mers in genomics datasets. At its core, ntCard uses the ntHash algorithm to efficiently compute hash values for streamed sequences. It then samples the calculated hash values to build a reduced representation multiplicity table describing the sample distribution. Finally, it uses a statistical model to reconstruct the population distribution from the sample distribution. We have compared the performance of ntCard and other cardinality estimation algorithms. We used three datasets of 480 GB, 500 GB and 2.4 TB in size, where the first two representing whole genome shotgun sequencing experiments on the human genome and the last one on the white spruce genome. Results show ntCard estimates
<italic>k</italic>
-mer coverage frequencies >15× faster than the state-of-the-art algorithms, using similar amount of memory, and with higher accuracy rates. Thus, our benchmarks demonstrate ntCard as a potentially enabling technology for large-scale genomics applications.</p>
</sec>
<sec id="SA3">
<title>Availability and Implementation</title>
<p>ntCard is written in C ++ and is released under the GPL license. It is freely available at
<ext-link ext-link-type="uri" xlink:href="https://github.com/bcgsc/ntCard">https://github.com/bcgsc/ntCard</ext-link>
.</p>
</sec>
<sec id="SA4">
<title>Supplementary information</title>
<p>
<xref ref-type="supplementary-material" rid="sup1">Supplementary data</xref>
are available at
<italic>Bioinformatics</italic>
online.</p>
</sec>
</abstract>
<funding-group>
<award-group award-type="grant">
<funding-source>
<named-content content-type="funder-name">National Institutes of Health</named-content>
<named-content content-type="funder-identifier">10.13039/100000002</named-content>
</funding-source>
<award-id>R01HG007182</award-id>
</award-group>
</funding-group>
<counts>
<page-count count="7"></page-count>
</counts>
</article-meta>
</front>
</pmc>
<affiliations>
<list>
<country>
<li>Canada</li>
</country>
</list>
<tree>
<country name="Canada">
<noRegion>
<name sortKey="Mohamadi, Hamid" sort="Mohamadi, Hamid" uniqKey="Mohamadi H" first="Hamid" last="Mohamadi">Hamid Mohamadi</name>
</noRegion>
<name sortKey="Birol, Inanc" sort="Birol, Inanc" uniqKey="Birol I" first="Inanc" last="Birol">Inanc Birol</name>
<name sortKey="Birol, Inanc" sort="Birol, Inanc" uniqKey="Birol I" first="Inanc" last="Birol">Inanc Birol</name>
<name sortKey="Khan, Hamza" sort="Khan, Hamza" uniqKey="Khan H" first="Hamza" last="Khan">Hamza Khan</name>
<name sortKey="Khan, Hamza" sort="Khan, Hamza" uniqKey="Khan H" first="Hamza" last="Khan">Hamza Khan</name>
<name sortKey="Mohamadi, Hamid" sort="Mohamadi, Hamid" uniqKey="Mohamadi H" first="Hamid" last="Mohamadi">Hamid Mohamadi</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Checkpoint
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000667 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Checkpoint/biblio.hfd -nk 000667 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Checkpoint
   |type=    RBID
   |clé=     PMC:5408799
   |texte=   ntCard: a streaming algorithm for cardinality estimation in genomics data
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Checkpoint/RBID.i   -Sk "pubmed:28453674" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Checkpoint/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021